## Abstract

Locating the propagation source is one of the most important strategies to control the harmful diffusion process on complex networks. Most existing methods only consider the infection time information of the observers, but the diffusion direction information of the observers is ignored, which is helpful to locate the source. In this paper, we consider both of the diffusion direction information and the infection time information to locate the source. We introduce a relaxed direction-induced search (DIS) to utilize the diffusion direction information of the observers to approximate the actual diffusion tree on a network. Based on the relaxed DIS, we further utilize the infection time information of the observers to define two kinds of observers-based similarity measures, including the Infection Time Similarity and the Infection Time Order Similarity. With the two kinds of similarity measures and the relaxed DIS, a novel source locating method is proposed. We validate the performance of the proposed method on a series of synthetic and real networks. The experimental results show that the proposed method is feasible and effective in accurately locating the propagation source.

## 1 Introduction

In the modern world, the ubiquity of the diffusion phenomena taking on networks has incurred huge losses to human society. Some typical examples include computer virus propagation (Wang et al. 2014a), disease spreading (Zhang et al. 2018) and rumor diffusion (Hosseini and Azgomi 2016), etc. It is of great theoretical and practical significance to develop effective strategies to control the harmful diffusion process (Yu et al. 2017). As one of the significant measures, propagation source locating has attracted widespread attentions, many effective methods are proposed in recent years (Jiang et al. 2017; Paluch et al. 2020). These methods can provide effective solutions for many important issues in reality, including locating the source(s) of SARS (Brockmann and Helbing 2013), COVID-19 (Tian et al. 2020), Cholera (Li et al. 2021), identifying the source of delay in public transportation networks (Manitz et al. 2017), estimating the source of foodborne disease (Horn and Friedrich 2019), etc.

It is well known that, when a diffusion process occurs on a network, there exists a spanning tree corresponding to the first time each node gets infected (Shah and Zaman 2011; Pinto et al. 2012; Tang et al. 2018). In fact, reconstructing the spanning tree is helpful to locate the propagation source (Yang et al. 2020). However, the commonly used breadth-first search (BFS) heuristic (Shah and Zaman 2011; Pinto et al. 2012; Yang et al. 2016) may be not an effective strategy (Tang et al. 2018; Yang et al. 2020). In this paper, we introduce an effective graph traversal method termed as relaxed direction-induced search (DIS), which is developed in our previous work (Yang et al. 2020). By utilizing the diffusion direction information of the observers, the relaxed DIS could effectively approximate the spanning tree corresponding to the first time each node gets infected. Based on the relaxed DIS, we further utilize the infection time information of the observers to define two kinds of observers-based similarity measures: (1) Infection Time Similarity, which measures the similarity between the observation infection time of the given observers and the measuring infection time of the given observers. (2) Infection Time Order Similarity, which measures the similarity between two sorted sequences of the given observers. One sequence is the observers ascending order obtained by sorting the observation infection time of the observers. Another sequence is the observers ascending order obtained by sorting the measuring infection time of the observers. Further, with the two kinds of similarity measures and the relaxed DIS, we propose a novel source locating method. Obviously, in the proposed method, both of the diffusion direction information and the infection time information are considered. Experiments are performed on a series of synthetic and real networks; the results show that the proposed method is feasible and effective in accurately locating the propagation source.

The rest of this paper is organized as follows. Existing related works are briefly reviewed in Sect. 2. We introduce the direction-induced search (DIS) in Sect. 3. Our method is proposed in Sect. 4. The performance of the proposed method is validated in Sect. 5. We conclude this work in Sect. 6.

## 2 Related work

For unweighted networks, a systematic method for propagation source locating was pioneered by Shah et al. (2011); they constructed a source estimator based on a novel topological quantity which is termed as Rumor Centrality (RC). Some researchers extended the RC to more complex environments, such as utilizing multiple observations to locate the source (Wang et al. 2014b), locating multi-sources (Luo et al. 2013; Wang et al. 2015) and so on. Zhu et al. (2016) developed a sample path-based method termed as Jordan Center (JC). Several improved methods based on the JC were developed to locate the source(s) with sparse observations (Zhu and Ying 2014; W.Luo et al. 2014; Jiang et al. 2018). Meanwhile, many source locating methods based on various ideas were developed for unweighted networks, including the Dynamic Message Passing-based method (Lokhov et al. 2014), the Belief Propagation base method (Altarelli et al. 2014), the Minimum Description Length-based method (Prakash et al. 2014), the Monte Carlo-based method (Antulov-Fantulin et al. 2015), the Rationality Observation-based method (Yang et al. 2016), the Time Aggregated Graph-based method (Chai et al. 2021), etc. The above methods are effective in unweighted networks. However, in reality, we have to consider various significant weights associated with the edges in networks, such as the traffic, the propagation delay and so on.

For weighted networks, Brockmann et al. (2013) modeled the Global Mobility Network as a weighted graph and proposed a source locating method based on a novel effective distance. This method is extended to more complex environments, including identifying the multiple sources (Jiang et al. 2015), identifying the source of delay in public transportation networks (Manitz et al. 2017), etc. However, the effective distance-based methods require the complete knowledge of nodes state. Meanwhile, there are several source locating methods based on various ideas for weighted networks (Cai et al. 2018; Chang et al. 2020; Feizi et al. 2019). But these methods also require the complete knowledge of nodes state. In reality, it is often the case that only limited nodes state can be observed (Caputo et al. 2019). To this problem, many methods were developed to locate the source with limited observers. Shen et al. (2016) developed a time-reversal backward spreading (TRBS) algorithm, but this algorithm may not work if the locatability condition is violated. Hu et al. (2019) proposed a greedy optimization algorithm to reduce the number of observers for TRBS. Tang and Ji et al. (2018) et al. proposed a source estimation algorithm based on the Gromov matrix. However, the Gromov matrix may be not the optimal heuristic for source locating. Meanwhile, Fu et al. (2016) proposed a backward diffusion-based method for multiple sources locating. Wang (2019) and Xu et al. (2019) identified the diffusion source based on the Spearman’s coefficient. Wang and Sun (2020) proposed a sequential neighbor filtering (SNF) algorithm for heterogeneous propagation models. Wang et al. (2021) proposed three source locating algorithms by defining the estimated mean and standard deviation of the propagation delay. However, the methods using limited observers only considered the infection time information, but the diffusion direction information was ignored.

The Gaussian estimator (Pinto et al. 2012) first located the source with limited observers by utilizing the diffusion direction information of the observers, and its time complexity can be reduced by ignoring the observers with low-quality information (Paluch et al. 2018). However, the diffusion direction information is only used in the tree graphs. In our previous work (Yang et al. 2020), a relaxed direction-induced search (DIS) was proposed by utilizing the diffusion direction information. With the relaxed DIS, the accuracy of the Gaussian estimator on general graphs is improved. Different from the previous work, in this paper, we first introduce the relaxed direction-induced search (DIS) (Yang et al. 2020) to utilize the diffusion direction information of the observers to approximate the actual diffusion tree on a network. Based on the relaxed DIS, we further utilize the infection time information of the observers to define two kinds of similarity measures, including the Infection Time Similarity and the Infection Time Order Similarity. With the two kinds of similarity measures and the relaxed DIS, we propose a novel source locating method. Obviously, the diffusion direction information and the infection time information are combined in this method. The feasibility and effectiveness of this method are validated on a series of synthetic and real networks.

## 3 Preliminaries

A network is modeled as an undirected and weighted graph \(\mathcal {G}=\left( \mathcal {V}, \mathcal {E}, {\varvec{\theta }}\right) \), where \(\mathcal {V}\) and \(\mathcal {E}\) represent the nodes set and edges set, respectively. \({\varvec{\theta }}=\left\{ \theta _{uv}\right\} \), where \(\theta _{uv}\) denotes the random propagation delay associated with an edge connecting nodes *u* and *v*, \(u, v\in \mathcal {V}\), \(vu\in \mathcal {E}\). The random variables \(\theta _{vu}\) for different edges *vu* have a known, arbitrary joint distribution.

*Diffusion model* Similar to the references (Zhu and Ying 2016; W.Luo et al. 2014; Lokhov et al. 2014; Yang et al. 2016), the diffusion process on \(\mathcal {G}\) is discrete. We adopt a simple Susceptible-Infectious (SI) model. With the SI model, each node in \({\mathcal {V}}\) is only in one of the two states: (1) susceptible, if it has not been infected so far, or (2) infectious, if it has been infected by any one neighbor. The diffusion process on \(\mathcal {G}\) is initiated by a single propagation source (denoted by \(s^*\)) at an unknown time \(t^*\). All nodes are susceptible except for \(s^*\) is infectious. A diffusion is possible from an infected node to a susceptible node if and only if there is an edge between them. Once infected, the node will stay the infectious state forever. Let \(\mathcal {N}\left( v\right) \) denote the neighbors set of node *v*, suppose *v* is infected by one neighbor *w* at time \(t_{v}\), then *v* will attempt to infect each susceptible neighbor \(u\in \mathcal {N}\left( v\right) \) (except for *w*) along the weighted edge *vu* with propagation ratio \(\beta \). If there are two or more infected neighbors having a same propagation delay to *u*, *u* can be first time infected by only one neighbor. Without loss of generality, the diffusion process is terminated when there are no susceptible nodes in \(\mathcal {G}\).

Let \(\mathcal {O}=\left\{ o_k\right\} ^{\mathcal {K}}_{k=1}\subseteq \mathcal {V}\) denote the set of \(\mathcal {K}\) observable nodes on \(\mathcal {G}\), termed as observers set, whose location in \(\mathcal {G}\) is known. Generally, there is \(\mathcal {K}\ll \vert \mathcal {V}\vert \). Similar to the references (Pinto et al. 2012; Yang et al. 2020), each \(o_k\in \mathcal {O}\) can provide two types of information: (1) the Diffusion Direction information in which the infection arrives to \(o_k\), (2) the Infection Timing information at which the infection arrives to \(o_k\).

*Direction-induced search (DIS)* We introduce a graph traversal method termed as relaxed direction-induced search (DIS), which is developed in our previous work (Yang et al. 2020). The relaxed DIS is summarized in Algorithm 1. The \(E\left( \mathcal {T}\right) \) declared in line 2 is an edge set. The function of lines 3–19 is to traverse the \(\mathcal {G}\) with *s* as root, which requires \(O\left( \vert \mathcal {V}\vert ^2\right) \) computations in the worst case. Here, from lines 10–16, we know that, if a node is an observer, then the infection direction is determined by the Diffusion Direction information recorded in this node. If the node is a non-observer, the infection direction will be assumed to be its current parent node. From lines 20–23, a DIS spanning tree \({\mathcal {T}}_{\textrm{dis},s}\) is generated if and only if “\(\vert E\left( \mathcal {T} \right) \vert ==\vert V\vert -1\)”, where line 20 requires \(O\left( \vert \mathcal {V}\vert +\vert \mathcal {E}\vert \right) \) computations. Finally, taking the loop in line 1 into account, the time complexity of Algorithm 1 is \(O\left( \vert \mathcal {V}\vert ^3\right) \). Further, by using Algorithm 1, a relaxed DIS spanning tree is generated by utilizing the Diffusion Direction information recorded in \(\mathcal {O}\).

Frequently used notations are summarized in Table 1.

## 4 The proposed method

Given an arbitrary \(\mathcal {G}\) and an arbitrary \(\mathcal {O}\), we locate the propagation source by measuring the similarity between the \(\mathcal {O}\) in the actual diffusion tree (corresponding to the first time each node gets infected) and the \(\mathcal {O}\) in a spanning tree of \(\mathcal {G}\), which can be described as an estimator that maximizes the similarity.

where \(\mathcal {T}_{s^*}\) denotes the actual diffusion tree with source \(s^*\) as root, and \(\mathcal {T}_s\) denotes a tree that spans all nodes in \(\mathcal {G}\) with node *s* as root. \(\mathcal {O}_{\mathcal {T}_{s^*}}\) and \(\mathcal {O}_{\mathcal {T}_s}\) denote the given \(\mathcal {O}\) in \(\mathcal {T}_{s^*}\) and \(\mathcal {T}_s\), respectively. \(\varvec{\mathcal {S}}\left( \mathcal {O}_{\mathcal {T}_{s^*}}, \mathcal {O}_{\mathcal {T}_s}\right) \) denotes the similarity between \(\mathcal {O}_{\mathcal {T}_{s^*}}\) and \(\mathcal {O}_{\mathcal {T}_s}\).

Theoretically, we have to evaluate \(\varvec{\mathcal {S}}\left( \mathcal {O}_{\mathcal {T}_{s^*}}, \mathcal {O}_{\mathcal {T}_s}\right) \) in Eq. 1 for all spanning trees of \(\mathcal {G}\) and then select the one with the maximal similarity and its root is the \(s^*\). However, the complexity to generate all spanning trees of \(\mathcal {G}\) will increase exponentially with the number of nodes. Therefore, we introduce an approximation by assuming that the actual diffusion tree is a relaxed DIS spanning tree (obtained by Algorithm 1), and the time complexity only requires \(O\left( \vert \mathcal {V}\vert ^3\right) \). Then, Eq. 1 can be modified as follows.

where \(\mathcal {T}_{{\text {DIS}},s}\) is a relaxed DIS spanning tree of \(\mathcal {G}\) with a node *s* as root. \(\mathcal {O}_{\mathcal {T}_{{\text {DIS}},s}}\) denote the given \(\mathcal {O}\) in \(\mathcal {T}_{{\text {DIS}},s}\). \(\varvec{\mathcal {S}}\left( \mathcal {O}_{\mathcal {T}_{s^*}}, \mathcal {O}_{\mathcal {T}_{{\text {DIS}},s}}\right) \) denotes the similarity between \(\mathcal {O}_{\mathcal {T}_{s^*}}\) and \(\mathcal {O}_{\mathcal {T}_{{\text {DIS}},s}}\).

Since \(\mathcal {K}<\vert \mathcal {V}\vert \), \(\mathcal {T}_{{\text {DIS}},s}\) may be not unique, and each \(\mathcal {T}_{{\text {DIS}},s}\) may not correspond to the actual diffusion tree. Thus, the relaxed DIS is obviously a sub-optimal heuristic.

### 4.1 Observers-based similarity measures

In this subsection, we first define two kinds of observers-based similarity measures by utilizing the infection time information of observers. One is Infection Time Similarity; another is Infection Time Order Similarity.

### Definition 1

Observation Infection Time. Given an arbitrary \(\mathcal {G}=\left( \mathcal {V},\mathcal {E},{\varvec{\theta }}\right) \) and an observers set \(\mathcal {O}=\left\{ o_k\right\} ^{\mathcal {K}}_{k=1}\). the Observation Infection Time of \(\mathcal {O}\) is defined as \({{\textbf {T}}}_{\mathcal {O}}=\left\{ t_{o_k}\right\} ^{\mathcal {K}}_{k=1}\), where \(t_{o_k}\) denotes the Infection Timing information recorded in \(o_k\).

### Definition 2

Measuring Infection Time. Given an arbitrary \(\mathcal {G}=\left( \mathcal {V},\mathcal {E},{\varvec{\theta }}\right) \) and an observers set \(\mathcal {O}=\left\{ o_k\right\} ^{\mathcal {K}}_{k=1}\). The Measuring Infection Time of \(\mathcal {O}\) in \({\mathcal {T}_{{\text {DIS}},s}}\) is defined as:

where \(\mathcal {T}_{{\text {DIS}},s}\) is a relaxed DIS spanning tree of \(\mathcal {G}\) with a node *s* as root, \(s\in \mathcal {V}\). \(t_{v_{o_k}}\) denotes the Measuring Infection Time of node \(v_{o_k}\); \(v_{o_k}\) is the node with the node number corresponding to the observer \(o_k\).

where \(t^*\) is the unknown start time. *s* is the root of \({\mathcal {T}_{{\text {DIS}},s}}\). \(p\left( s,v_{o_k}\right) \) denotes the path from *s* to \(v_{o_k}\) in the \({\mathcal {T}_{{\text {DIS}},s}}\). \(\theta _j\) denotes the propagation delay.

In fact, in the current fixed \({\mathcal {T}_{{\text {DIS}},s}}\) with *s* as root, the *s* is assumed to be the propagation source. Thus, \(t^*\) minimizes the difference between \({{\textbf {T}}}_{\mathcal {O}}\) and \({{\textbf {T}}}_{\mathcal {T}_{{\text {DIS}},s}}\); \(t^*\) can be estimated by the following function.

where \(t^*\in \left[ 0,z\right] \), \(z\le \sum \theta \), \(\theta \in {\varvec{\theta }}\). \(t_{v_{o_k}}\) denotes the Measuring Infection Time of \(v_{o_k}\). \(t_{o_k}\) denotes the Infection Timing information recorded of \(o_k\).

### Definition 3

Infection Time Similarity is defined as:

where \(\mathcal {D}\left( {{\textbf {T}}}_{\mathcal {O}},{{\textbf {T}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) \) denotes the Euclidean distance (Rui and Wunsch 2005) between the \({{\textbf {T}}}_{\mathcal {O}}\) (Definition 1) and the \({{\textbf {T}}}_{\mathcal {T}_{{\text {DIS}},s}}\) (Definition 2). \(\varvec{\mathcal {S}}\left( {{\textbf {T}}}_{\mathcal {O}},{{\textbf {T}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) \in \left( 0,1\right] \).

### Definition 4

Observation Infection Time Order. Given an arbitrary \(\mathcal {G}=\left( \mathcal {V},\mathcal {E},{\varvec{\theta }}\right) \) and an observers set \(\mathcal {O}=\left\{ o_k\right\} ^{\mathcal {K}}_{k=1}\). The Observation Time Order of \(\mathcal {O}\) is defined as an ordered observers sequence \({{\textbf {TO}}}_{\mathcal {O}}=\left\langle o_i\right\rangle ^\mathcal {K}_{i=1}\), in which each \(o_i\) is sorted by ascending order according to the Infection Timing information (denoted by \(t_{o_i}\)) recorded in \(o_i\). For any pair of observers \(o_i,o_{i+1}\in {{\textbf {TO}}}_{\mathcal {O}}\), there is \(t_{o_i}\le t_{o_{i+1}}\).

### Definition 5

Measuring Infection Time Order. Given an arbitrary \(\mathcal {G}=\left( \mathcal {V},\mathcal {E},{\varvec{\theta }}\right) \) and an observers set \(\mathcal {O}=\left\{ o_k\right\} ^{\mathcal {K}}_{k=1}\). The Measuring Infection Time Order of \(\mathcal {O}\) on \(\mathcal {T}_{{\text {DIS}},s}\) is defined as an ordered nodes sequence \({{\textbf {TO}}}_{\mathcal {T}_{{\text {DIS}},s}}=\left\langle v_{o_k}\right\rangle ^\mathcal {K}_{k=1}\), in which each \(v_{o_k}\in {{\textbf {TO}}}_{\mathcal {T}_{{\text {DIS}},s}}\) is sorted by ascending order according to the Measuring Infection Time \(t_{v_{o_k}}\) (Definition 2). For any pair of nodes \(v_{o_k},v_{o_{k+1}}\in {{\textbf {TO}}}_{\mathcal {T}_{{\text {DIS}},s}}\), there is \(t_{v_{o_k}}\le t_{v_{o_{k+1}}}\).

### Definition 6

Infection Time Order Similarity is defined as:

where \(\tau \) denotes the correlation coefficient defined in reference (Kendall 1938); the details can be found in Appendix A. \(\tau \) is mainly used to measure the concordance between the \({{\textbf {TO}}}_{\mathcal {O}}\) (Definition 4) and the \({{\textbf {TO}}}_{\mathcal {T}_{{\text {DIS}},s}}\) (Definition 5). \(\varvec{\mathcal {S}}\left( {{\textbf {TO}}}_{\mathcal {O}},{{\textbf {TO}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) \in \left[ 0,1\right] \).

The properties related to the Infection Time Similarity (Definition 3) and the Infection Time Order Similarity (Definition 6) can be found in Appendix B.

*Example:* In Fig. 1, Fig. 1b shows a relaxed DIS spanning tree of the network shown in Fig. 1a. \(\mathcal {O}=\left\{ o_1, o_2, o_3\right\} \), \(o_1\), \(o_2\) and \(o_3\) correspond to nodes 2, 3 and 6, respectively. According to Definition 1, for \({{\textbf {T}}}_{\mathcal {O}}\), \(t_2=4\), \(t_3=7\), \(t_6=6\). When the current root is node 1, according to Eq. 5, \(t^*=1\). According to Definition 2, for \({{\textbf {T}}}_{\mathcal {T}_{{\text {DIS}},s}}\), \(t_2=4\), \(t_3=7\), \(t_6=6\). Then, according to Definition 3, we have \(\varvec{\mathcal {S}}\left( {{\textbf {T}}}_{\mathcal {O}},{{\textbf {T}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) =1\). According to Definition 4, \({{\textbf {TO}}}_{\mathcal {O}}=\left\langle 2,6,3\right\rangle \). According to Definition 5, \({{\textbf {TO}}}_{\mathcal {T}_{{\text {DIS}},s}}=\left\langle 2,6,3\right\rangle \). Further, with Definition 6, we have \(\varvec{\mathcal {S}}\left( {{\textbf {TO}}}_{\mathcal {O}},{{\textbf {TO}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) =1\).

### 4.2 Locating the propagation source

By combining the Infection Time Similarity (Definition 3) and the Infection Time Order Similarity (Definition 6), the source estimator in Eq. 2 can be written as follows:

where \(\varvec{\mathcal {S}}\left( {{\textbf {T}}}_{\mathcal {O}},{{\textbf {T}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) \) and \(\varvec{\mathcal {S}}\left( {{\textbf {TO}}}_{\mathcal {O}},{{\textbf {TO}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) \) are defined in Definition 3 and Definition 6, respectively.

Based on Eq. 8, we propose a novel source locating method, termed as OSDIS algorithm, which is summarized in Algorithm 2.

Algorithm 2 analysis: The \(E\left( \mathcal {T}\right) \) declared in line 2 is an edge set. Lines 3–19 are used for traversing the \(\mathcal {G}\) by the relaxed DIS with current node *s* as root and recording the eligible edges into \(E\left( \mathcal {T}\right) \). Lines 3–19 require \(O\left( \vert \mathcal {V}\vert ^2\right) \) computations in the worst case. Line 20 requires \(O\left( \vert \mathcal {V}\vert +\vert \mathcal {E}\vert \right) \) computations, which can be reduced to \(O\left( \vert \mathcal {E}\vert \right) \). In lines 21–22, the \(\mathcal {T}\) obtained in line 20 will be marked as a relaxed DIS spanning tree \(\mathcal {T}_{{\text {DIS}},s}\) if and only if \(\vert E\left( \mathcal {T}\right) \vert ==\vert \mathcal {V}\vert -1\). Line 23 requires \(O\left( \mathcal {K}\right) \) computations. Line 24 requires \(O\left( \vert \mathcal {V}\vert ^2+z\vert \mathcal {V}\vert \mathcal {K}\right) \) computations (*z* can be found in Eq. 5). Lines 25–26 require \(O\left( \vert \mathcal {V}\vert \right) \) and \(O\left( \mathcal {K}^2\right) \) computations, respectively. Both lines 27 and 28 require \(O\left( \mathcal {K}\log \mathcal {K}\right) \) computations. Line 29 requires \(O\left( \mathcal {K}^2\right) \) computations. Finally, each node \(s\in \mathcal {V}\) will be used as root to construct different \(\mathcal {T}_{{\text {DIS}},s}\). Thus, the time complexity of Algorithm 2 is \(O\left( \vert \mathcal {V}\vert ^3+z\vert \mathcal {V}\vert ^2\mathcal {K}\right) \).

## 5 Experimental evaluation

To validate the feasibility and effectiveness of the OSDIS algorithm, it is compared with other four state-of-the-art methods on a series of synthetic and real networks. The four methods include the Gauss algorithm (Pinto et al. 2012), GSSI algorithm (Tang et al. 2018), TRBS algorithm (Shen et al. 2016) and SNF algorithm (Wang and Sun 2020). Besides, since the OSDIS algorithm is based on the relaxed DIS heuristic, to show its advantage, we also define an algorithm, denoted by OSBFS, in which the relaxed DIS heuristic is replaced by the breadth-first search (BFS) heuristic. Totally, six algorithms are compared in the experiments. Their time complexity is shown in Table 2. Similar to the reference (Yang et al. 2020), the performance of a source locating algorithm is mainly evaluated by the precision (the precise locating ratio, i.e., the proportion of 0 error hop), the average error hop and the average error delay. For the precision, the higher the value is, the better the algorithm is. For the average error hop and the average error delay, the smaller the value is, the better the algorithm is.

*Running environment* Hardware: Dell R740 with 2 Intel\(^{\small {\textcircled {\tiny {R}}}}\)Xeon\(^{\small {\textcircled {\tiny {R}}}}\) gold 6254 CPU, 1T RAM. Software: Cygwin 3.0.7 + Eclipse Cpp2019 + igraph C 0.7.1 + Eigen/Dense (used for running algorithms). R 64 \(\times \) 3.3.3 + igraph R 1.2.1 (used for generating synthetic networks).

*Datasets* The six algorithms are evaluated on a series of synthetic and real networks. The synthetic networks include the scale-free (BA) model (Barabasi and Albert 1999) and the small-world (WS) model (Watts and Strogatz 1998). Totally, six BA models with different powers of the preferential attachment and five WS models with different rewiring probabilities are generated, respectively. The detailed parameters for generating these synthetic networks are shown in Table 3. The real networks are selected from different fields, which can be obtained from the Koblenz Network Collection (Kunegis 2013) and the Network Data Repository (Rossi and Ahmed 2015) for free. All the real networks are shown in Table 4. The topology properties of the used networks are shown in Table 5.

*Parameters setting* Given an arbitrary graph \(\mathcal {G}\), the propagation delays set \({\varvec{\theta }}\) are independent identically distributed (i.i.d) random variables with Gaussian distribution \(\mathcal {N}\left( \mu , \sigma ^2 \right) \), \(\mu \) and \(\sigma ^2\) are known (Pinto et al. 2012; Paluch et al. 2018). We set \(\mu /\sigma =4\). The diffusion model follows the one introduced in Sect. 3. To investigate the impact of different propagation ratios (denoted by \(\beta \)) on the performance of the source locating algorithms, we set \(\beta =0.25\), \(\beta =0.50\) and \(\beta =0.75\), respectively. Additionally, to compare with the GSSI algorithm (Tang et al. 2018), we set \(s^*\notin \mathcal {O}\). Generally, in reality, to save the cost, the number of the observers will be far less than the size of \(\mathcal {G}\). Thus, we randomly select \(5\%\) nodes as the observers in each network.

### 5.1 Experimental results on the synthetic networks

Figures 2, 3, 4, 5, 6, 7 show the precision (the precise locating ratio, i.e., the proportion of 0 error hop) of the six algorithms on a series of BA models. From Figs. 2, 3, 4, 5, 6, 7, we can see that, when \(\beta =0.25\), \(\beta =0.5\) and \(\beta =0.75\), the OSDIS algorithm generally exposes the best precision on all the six BA models, i.e., the OSDIS has a higher proportion in 0 error hop than other five algorithms. Only when \(\beta =0.75\), the OSDIS is inferior to the GSSI and TRBS on BA model (1), but outperforms other three algorithms. From Table 6, we know that, when \(\beta =0.25\), \(\beta =0.5\) and \(\beta =0.75\), the OSDIS is better than other five algorithms in the average error hop on all the six BA models. Only when \(\beta =0.5\), the OSDIS is a litter inferior to OSBFS on BA model (5), but superior to other four algorithms. Meanwhile, in Fig. 8, we plot interquartile range (IQR) to show the distribution regions of error hop of the six algorithms on the BA models. Additionally, from Table 7, we can see that, when \(\beta =0.25\), \(\beta =0.5\) and \(\beta =0.75\), the OSDIS exposes a better average error delay on all the six BA models. Only when \(\beta =0.75\), the OSDIS is inferior to TRBS on BA model (1), but outperforms other four algorithms. In summary, on the BA models, the OSDIS is generally better than other five algorithms in the precision, the average error hop and average error delay.

Figures 9, 10, 11, 12, 13 show the precision (the precise locating ratio, i.e., the proportion of 0 error hop) of the six algorithms on a series of WS models. From Figs. 9, 10, 11, 12, 13, we can see that, when \(\beta =0.25\), the OSDIS is superior to other five algorithms in the precision on WS models (1)–(4) (i.e., the OSDIS has a higher proportion in 0 error hop), but only inferior to TRBS and GSSI on WS model (5). When \(\beta =0.5\) and \(\beta =0.75\), the OSDIS is generally inferior to TRBS and GSSI in the precision on WS models (1)–(5), but superior to other three algorithms. Only when \(\beta =0.5\), the OSDIS exposes the best precision on WS model (5). From Tables 6 and 7, we know that, when \(\beta =0.25\) and \(\beta =0.5\), the OSDIS is generally better than other five algorithms in the average error hop and average error delay on all the five WS models. Only when \(\beta =0.5\), the OSDIS is inferior to TRBS on WS model (2). Meanwhile, from Tables 6 and 7, we know that, when \(\beta =0.75\), the OSDIS is always inferior to GSSI and TRBS in the average error hop and average error delay, but outperforms other three algorithms. In Fig. 14, we further plot interquartile range (IQR) to show the distribution regions of error hop of the six algorithms on the WS models. In summary, on the WS models, the OSDIS is generally superior to other five algorithms in the precision when \(\beta =0.25\) and generally exposes a better performance in the average error hop and average error delay when \(\beta =0.25\) and \(\beta =0.5\). Thus, the OSDIS is better than other five algorithms in most cases. Obviously, the performance of the OSDIS on the BA models is better than on the WS models.

### 5.2 Experimental results on the real networks

In this subsection, we further validate the performance of the six algorithms on the real networks. Figures 15, 16, 17, 18, 19, 20, 21 show the precision (the precise locating ratio, i.e., the proportion of 0 error hop) of the six algorithms on the real networks. When \(\beta =0.25\), from Figs. 15, 16, 17, 18, 19, 20, 21, we can see that the OSDIS generally exposes the best precision (i.e., the OSDIS has a higher proportion in 0 error hop) on all the real networks, except for Euroroads network on which the OSDIS is only inferior to GSSI, but superior to other four algorithms. By combining with Tables 6 and 7, we know that the OSDIS is also superior to other five algorithms in the average error hop and the average error delay on all the real networks, except for Euroroads network on which the OSDIS is only inferior to GSSI, but superior to other four algorithms. When \(\beta =0.5\), from Figs. 15, 16, 17, 18, 19, 20, 21, we know that the OSDIS exposes the best precision (i.e., the OSDIS has a higher proportion in 0 error hop) on Dolphins, Lesmis, USAirlines and Celegans networks. Meanwhile, the OSDIS is only inferior to GSSI in the precision on PDZBase, NetScience and Euroroads networks, but superior to other four algorithms. By combining with Tables 6 and 7, we can see that the OSDIS generally outperforms other five algorithms in the average error hop and the average error delay on the real networks. Only on USAirlines network, the OSDIS is inferior to GSSI and TRBS in the average error delay. When \(\beta =0.75\), from Figs. 15, 16, 17, 18, 19, 20, 21, we know that, except for Euroroads network, the OSDIS is generally inferior to TRBS or GSSI in the precision on the real networks. By combining with Tables 6 and 7, we can see that the OSDIS outperforms other five algorithms in the average error hop and average error delay on Dolphins, Lesmis, PDZBase, NetScience and Euroroads networks, but is inferior to GSSI and TRBS on USAirlines and Celegans networks. Meanwhile, in Appendix C Figs. 22, 23, 24, 25, 26, 27, 28, we plot interquartile range (IQR) to further show the distribution regions of error hop of the six algorithms on the real networks. In summary, on the real networks, the OSDIS is generally superior to other five algorithms in the precision, the average error hop and the average error delay when \(\beta =0.25\) and \(\beta =0.5\), but inferior to GSSI and TRBS when \(\beta =0.75\). Thus, the OSDIS is better than other five algorithms in most cases.

Overall, in the precision, the average error hop and average error delay, the OSDIS outperforms other five algorithms on the BA models and in most cases is superior to other five algorithms on the WS models and real networks. In a few cases, the OSDIS is only inferior to GSSI and TRBS, but superior to other three algorithms. Thus, the OSDIS is a feasible and effective method in accurately locating the propagation source. Meanwhile, on the BA models, WS models and real networks, the OSDIS obviously outperforms the OSBFS, which indicates that the relaxed DIS heuristic outperforms the BFS heuristic in locating the propagation source.

The average error hops of the six algorithms on different networks are shown in Table 6. The average error delay is shown in Table 7. The average running time ratios between other five algorithms and the OSDIS on all networks are shown in Table 8. From Table 8, we can see that the efficiency of the OSDIS is inferior to the TRBS and SNF, similar with the OSBFS, and superior to the Gauss and GSSI.

## 6 Conclusion

In this paper, we locate the propagation source by utilizing both of the diffusion direction information and the infection time information of the observers. We introduce a relaxed direction-induced search (DIS) to utilize the diffusion direction information of the observers to approximate the actual diffusion tree on a network. Based on the relaxed DIS, we further utilize the infection time information of the observers to define two kinds of observer-based similarity measures, including the Infection Time Similarity and the Infection Time Order Similarity. With the two kinds of similarity measures and the relaxed DIS, a source locating method termed as OSDIS is proposed. The feasibility and effectiveness of the OSDIS are validated on a series of synthetic and real networks. Meanwhile, the experimental results also show that the relaxed DIS heuristic outperforms the BFS heuristic in propagation source locating. The current OSDIS is only developed for single source locating. In the future work, we will study the OSDIS-based multi-sources locating method.

## Data availability

Enquiries about data availability should be directed to the authors.

## References

Altarelli F, Braunstein A, Dall’Asta L, Lage-Castellanos A, Zecchina R (2014) Bayesian inference of epidemics on networks via belief propagation. Phys Rev Lett 112(11):118701

Antulov-Fantulin N, Lančić A, Šmuc T, Štefančić H, Šikić M (2015) Identification of patient zero in static and temporal networks: robustness and limitations. Phys Rev Lett 114(24):248701

Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512

Brockmann D, Helbing D (2013) The hidden geometry of complex, network-driven contagion phenomena. Science 342(6164):1337–1342

Cai K, Hong X, Lui JCS (2018) Information spreading forensics via sequential dependent snapshots. IEEE/ACM Trans Netw 26(1):478–491

Caputo JG, Hamdi A, Knippel A (2019) Inverse source problem in a forced network. Inverse Probl 35(5):055006

Chai Y, Wang Y, Zhu L (2021) Information sources estimation in time-varying networks. IEEE Trans Inf Forens Secur 99:2621–2636

Chang B, Chen E, Zhu F, Liu Q, Xu T, Wang Z (2020) Maximum a posteriori estimation for information source detection. IEEE Trans Syst Man Cybern Syst 50(6):2242–2256

Feizi S, Médard M, Quon G, Kellis M, Duffy K (2019) Network infusion to infer information sources in networks. IEEE Trans Netw Sci Eng 6(3):402–417

Fu L, Shen Z, Wang WX, Fan Y, Di Z (2016) Multi-source localization on complex networks with limited observers. EPL 113(1):18006

Horn AL, Friedrich H (2019) Locating the source of large-scale outbreaks of foodborne disease. J R Soc Interface 16(151):20180624

Hosseini S, Azgomi MA (2016) A model for malware propagation in scale-free networks based on rumor spreading process. Comput Netw 108:97–107

Hu Z, Wang L, Tang C (2019) Locating the source node of diffusion process in cyber-physical networks via minimum observers. Chaos Interdiscip J Nonlinear Sci 29(6):063117

Jiang J, Sheng W, Shui Y, Yang X, Zhou W (2017) Identifying propagation sources in networks: State-of-the-art and comparative studies. IEEE Commun Surv Tutor 19(1):465–481

Jiang J, Wen S, Yu S, Xiang Y, Zhou W (2018) Rumor source identification in social networks with time-varying topology. IEEE Trans Dependable Secure Comput 15(1):166–179

Jiang J, Wen S, Yu S, Xiang Y, Zhou W (2015) K-center: An approach on the multi-source identification of information diffusion. IEEE Trans Inf Forensics Secur 10(12):2616–2626

Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93

Kunegis J (2013) Konect: The koblenz network collection. In: Proceedings of the 22nd International Conference on World Wide Web, New York, NY, USA, pp 1343–1350

Li J, J M, Bertuzzo E, Kolaczyk E D, (2021) Sensor-based localization of epidemic sources on human mobility networks. PLoS Comput Biol 17(1):e1008545

Lokhov AY, Mézard M, Ohta H, Zdeborová L (2014) Inferring the origin of an epidemic with a dynamic message-passing algorithm. Phys Rev E 90(1):012801

Luo W, Tay WP, Leng M (2013) Identifying infection sources and regions in large networks. IEEE Trans Signal Process 61(11):2850–2865. https://doi.org/10.1109/TSP.2013.2256902

Luo W, Tay WP, Leng M (2014) How to identify an infection source with limited observations. IEEE J Sel Top Signal Process 8(4):586–597

Manitz J, Harbering J, Schmidt M, Kneib T, Schöbel A (2017) Source estimation for propagation processes on complex networks with an application to delays in public transportation systems. J R Stat Soc Series C Appl Stat 66(3):521–536

Newman MEJ (2002) Assortative mixing in networks. Phys Rev Lett 89(20):208701

Paluch R, Lu X, Suchecki K, Szymański BK, Holyst JA (2018) Fast and accurate detection of spread source in large complex networks. Sci Rep 8(1):2508

Paluch R, Gajewski Łukasz G, Hołyst JA, Szymanski BK (2020) Optimizing sensors placement in complex networks for localization of hidden signal source: A review. Future Gen Comput Syst 112:1070–1092

Pinto PC, Patrick T, Martin V (2012) Locating the source of diffusion in large-scale networks. Phys Rev Lett 109(6):068702

Prakash BA, Vreeken J, Faloutsos C (2014) Efficiently spotting the starting points of an epidemic in a large graph. Knowl Inf Syst 38(1):35–59

Rossi RA, Ahmed NK (2015) The network data repository with interactive graph analytics and visualization. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas, pp 4292–4293, https://ojs.aaai.org/index.php/AAAI/article/view/9277

Rui X, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

Shah D, Zaman T (2011) Rumors in a network: Who’s the culprit? IEEE Trans Inf Theory 57(8):5163–5181

Shen Z, Cao S, Wang WX, Di Z, Stanley HE (2016) Locating the source of diffusion in complex networks by time-reversal backward spreading. Phys Rev E 93(3):032301

Tang W, Ji F, Tay WP (2018) Estimating infection sources in networks using partial timestamps. IEEE Trans Inf Forensics Secur 13(12):3035–3049

Tian H, Liu Y, Li Y, Wu CH, Chen B, Kraemer MUG, Li B, Cai J, Xu B, Yang Q, Wang B, Yang P, Cui Y, Song Y, Zheng P, Wang Q, Bjornstad ON, Yang R, Grenfell BT, Pybus OG, Dye C (2020) An investigation of transmission control measures during the first 50 days of the covid-19 epidemic in china. Science 368(6491):638–642

Wang H (2019) An universal algorithm for source location in complex networks. Phys A Stat Mech Appl 514:620–630

Wang H, Sun K (2020) Locating source of heterogeneous propagation model by universal algorithm. EPL Europhys Lett 131(4):48001

Wang HJ, Zhang FF, Sun KJ (2021) An algorithm for locating propagation source in complex networks. Phys Lett A 393:127184

Wang Y, Wen S, Xiang Y, Zhou W (2014) Modeling the propagation of worms in networks: A survey. IEEE Commun Surv Tutor 16(2):942–960

Wang Z, Dong W, Zhang W, Tan CW (2014) Rumor source detection with multiple observations: Fundamental limits and algorithms. SIGMETRICS Perform Eval Rev 42(1):1–13

Wang Z, Dong W, Zhang W, Tan CW (2015) Rooting our rumor sources in online social networks: The value of diversity from multiple observations. IEEE J Sel Top Signal Process 9(4):663–677

Wang Z, Hou D, Gao C, Huang J, Xuan Q (2022) A rapid source localization method in the early stage of large-scale network propagation. In: Proceedings of the ACM Web Conference 2022, New York, NY, USA, pp 1372–1380

Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature 393(6684):440

Xu S, Teng C, Zhou Y, Peng J, Zhang ZK (2019) Identifying the diffusion source in complex networks with limited observers. Phys A Stat Mech Appl 527:121267. https://doi.org/10.1016/j.physa.2019.121267

Yang F, Zhang R, Yao Y, Yuan Y (2016) Locating the propagation source on complex networks with propagation centrality algorithm. Knowl Based Syst 100:112–123

Yang F, Yang S, Peng Y, Yao Y, Wang Z, Li H, Liu J, Zhang R, Li C (2020) Locating the propagation source in complex networks with a direction-induced search based gaussian estimator. Knowl Based Syst 195:105674

Yu F, Xia X, Li W, Tao J, Ma L, Cai Z (2017) Critical node identification for complex network based on a novel minimum connected dominating set. Soft Comput 21(19):5621–5629

Zhang X, Wu J, Zhao P, Su X, Choi D (2018) Epidemic spreading on a complex network with partial immunization. Soft Comput 22(14):4525–4533

Zhu K, Ying L (2014) A robust information source estimator with sparse observations. Comput Social Netw 1(1):1–21

Zhu K, Ying L (2016) Information source detection in the sir model: A sample-path-based approach. IEEE/ACM Trans Netw 24(1):408–421

## Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. 62062010), the Science and Technology Planning Project of Guangxi (Grant No. AD19245101), the Science and Technology Planning Project of Liuzhou City (Grant No. 2020PAAA0606), the National Natural Science Foundation of China (Grant No. 62061003, 61963006) and the Doctoral Foundation of Guangxi University of Science and Technology (Grant No. 19Z06).

## Author information

### Authors and Affiliations

### Corresponding author

## Ethics declarations

### Conflict of interest

The authors declare that they have no conflict of interest.

### Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### A. Correlation coefficient

The correlation coefficient \(\tau \) defined in reference (Kendall 1938) considers a set of joint ranks from two sequence \({{\textbf {X}}}\) and \({{\textbf {Y}}}\). For arbitrary \(x_i, x_j\in {{\textbf {X}}}\), \(y_i,y_j\in {{\textbf {Y}}}\), any pair of two-tuples \(\left( x_i,y_i\right) \) and \(\left( x_j,y_j\right) \) \(\left( i\ne j\right) \) are said to be concordant if both \(x_i>x_j\) and \(y_i>y_j\) or if both \(x_i<x_j\) and \(y_i<y_j\). They are said to be discordant if both \(x_i>x_j\) and \(y_i<y_j\) or if both \(x_i<x_j\) and \(y_i>y_j\). If \(x_i=x_j\) or \(y_i=y_j\), the pair is neither concordant nor discordant. \(\tau \) is defined as:

where \(n_c\) and \(n_d\) denote the number of concordant and discordant pairs, respectively. \(\tau \in \left[ -1,1\right] \).

### B. Properties

### Property 1

Suppose that the propagation ratio of the SI model is \(\beta =1\), for \(\mathcal {G}=\left( \mathcal {V},\mathcal {E},{\varvec{\theta }}\right) \) and \(\mathcal {O}=\left\{ o_k\right\} ^{\mathcal {K}}_{k=1}\), suppose \(\mathcal {O}=\mathcal {V}\), then

### Proof

By Eq. 6, the proof of Eq. 10 can be converted into the proof about \(\mathcal {D}\left( {{\textbf {T}}}_{\mathcal {O}},{{\textbf {T}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) =0\). Firstly, since \(\beta =1\), on the original \(\mathcal {G}\), each node \(o_k\) is first time infected along the weighted shortest path between \(s^*\) and \(o_k\). Meanwhile, the Infection Timing information can be recorded in \(o_k\), denoted by \(t_{o_k}\) (Definition 1). Secondly, since \(\mathcal {O}=\mathcal {V}\), by combining with Algorithm 1, we know that the \(\mathcal {T}_{{\text {DIS}},s}\) is the actual diffusion tree corresponding to the first time each node in \(\mathcal {G}\) gets infected. Thus, for the Measuring Infection Time of each \(v_{o_k}\) (denoted by \(t_{v_{o_k}}\), Definition 2), there is \(t_{v_{o_k}}=t_{o_k}\). According to the \(\mathcal {D}\left( {{\textbf {T}}}_{\mathcal {O}},{{\textbf {T}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) \) in Definition 3, there is \(\mathcal {D}\left( {{\textbf {T}}}_{\mathcal {O}},{{\textbf {T}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) =0\). Property 1 is proved. \(\square \)

### Property 2

Suppose that the propagation ratio of the SI model is \(\beta =1\), for \(\mathcal {G}=\left( \mathcal {V},\mathcal {E},{\varvec{\theta }}\right) \) and \(\mathcal {O}=\left\{ o_k\right\} ^{\mathcal {K}}_{k=1}\), suppose \(\mathcal {O}=\mathcal {V}\), then

### Proof

By Eq. 7, the proof of Eq. 11 can be converted into the proof about \(\tau \left( {{\textbf {TO}}}_{\mathcal {O}},{{\textbf {TO}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) =1\). From Property 1, we know that, since \(\mathcal {O}=\mathcal {V}\), then we get \(\varvec{\mathcal {S}}\left( {{\textbf {T}}}_{\mathcal {O}},{{\textbf {T}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) =1\). For \({{\textbf {T}}}_{\mathcal {O}}=\left\{ t_{o_k}\right\} ^{\mathcal {K}}_{k=1}\) and \({{\textbf {T}}}_{\mathcal {T}_{{\text {DIS}},s}}=\left\{ t_{v_{o_k}}\right\} ^\mathcal {K}_{k=1}\), there is \(t_{v_{o_k}}=t_{o_k}\). Thus, the Observation Infection Time Order in \({{\textbf {TO}}}_{\mathcal {O}}\) is consistent with the Measuring Infection Time Order in \({{\textbf {TO}}}_{\mathcal {T}_{{\text {DIS}},s}}\). According to the Definition of correlation coefficient \(\tau \) (Appendix A), \(\tau \left( {{\textbf {TO}}}_{\mathcal {O}},{{\textbf {TO}}}_{\mathcal {T}_{{\text {DIS}},s}}\right) =1\). Property 2 is proved. \(\square \)

### C. Supplemental material

The interquartile range (IQR) of error hop of Gauss, GSSI, TRBS, SNF, OSBFS and OSDIS algorithms on the real networks is shown in Figs. 22, 23, 24, 25, 26, 27, 28.

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

## About this article

### Cite this article

Yang, F., Li, C., Peng, Y. *et al.* Locating the propagation source in complex networks with observers-based similarity measures and direction-induced search.
*Soft Comput* **27**, 16059–16085 (2023). https://doi.org/10.1007/s00500-023-08000-7

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s00500-023-08000-7

### Keywords

- Complex networks
- Propagation source locating
- Direction-induced search (DIS)
- Observers-based similarity measures
- Diffusion direction information
- Infection time information