Keywords

1 Introduction

Efficient urban public transportation plays a crucial role in facilitating human mobility and sustaining economic development. However, the shared and confined high-density travel spaces create favourable conditions for the transmission of diseases among passengers [1]. Infected passengers may potentially transmit the disease to other non-infected passengers during travel. Due to variations in passenger behaviours, some individuals have the capacity to cause more extensive infections when they are infected, identifying them as public transportation super spreaders [2]. Therefore, it is essential to identify these super spreaders. This identification process can assist relevant authorities in implementing targeted measures, such as travel restrictions, vaccination programs, health monitoring, and other disease control strategies.

Currently, the identification methods [3,4,5] for super spreaders are not fully developed and do not take into account the robust community structure within the passenger contact network [6, 7]. In complex networks, community structures facilitate the spread within communities while limiting the spread between communities, thus impacting the speed and extent of transmission. Therefore, in situations characterized by a strong community structure, node identification methods that consider community structure prove more effective than traditional identification methods. Existing approaches for identifying public transportation super spreaders often overlook the community structure of the passenger contact network. This oversight results in a lack of insights into relationships within communities, leading to an inaccurate representation of epidemiological interactions among passengers. Consequently, these methods fail to precisely capture the transmission abilities of individual passengers, which, in turn, affects the final identification of public transportation super spreaders. This discrepancy may introduce biases into decision-making regarding epidemiological control strategies. Considering the community structure within the passenger contact network when identifying public transportation super spreaders may offer a more accurate identification of passenger groups with higher transmission capabilities.

2 Research Method

2.1 GHB Method

Passenger Contact Network Construction

The passenger contact network is defined as a weighted undirected graph G = {V, E, W, M}, where V = {vi| i = 1, 2, ⋅⋅⋅, n} is the set of nodes, with each node vi representing a passenger using public transportation. E = {eij | i, j = 1, 2, ⋅⋅⋅, n, i ≠ j} is the set of edges, where eij represents the connection between two passengers, vi and vj, who are simultaneously present in the same vehicle during their travel. W = {wij | i, j = 1, 2, ⋅⋅⋅, n, i ≠ j} is the set of edge weights, where the weight of each edge eij is denoted as wij, representing the duration of the passengers vi and vj in the same vehicle. C = {Ck| k = 1, 2, ⋅⋅⋅, m} is the set of communities obtained through community detection algorithms, with each community Ck containing Nk nodes.

Community Division

To identify public transport super spreaders while considering the community structure of the passenger contact network, it is necessary to employ a community detection algorithm. This algorithm partitions the network into several communities based on the network's topology, aiming for strong connections within communities and weak connections between them. The Infomap algorithm is efficient, stable, and applicable to community detection in large networks. Therefore, this study utilizes the Infomap algorithm to perform community detection on the weighted passenger contact network.

The Infomap algorithm, based on the minimization of code length, employs a random walk approach to identify communities with the shortest path encoding length [8]. The description length L(M) for the random walk paths generated by partitioning the network's n nodes into m communities using partition method M is represented by Eq. (1). Initially, the Infomap algorithm treats each node as an individual community and progressively merges adjacent communities to maximize the reduction of the objective function L(M) until the reduction becomes negligible.

$$ L(M) = q_{\rm{ \curvearrowright }} H(Q) + \sum_{i = 1}^m {p^i_{\rm{ \circlearrowright }} H(P^i )} $$
(1)
$$ H(X) = - \sum_l^n {p_i \log p_i } $$
(2)

where \(q_{\rm{ \curvearrowright }}\) is the probability that a certain step in the random walk will be converted to other communities at any node; H (Q) is the information entropy of random walks among different communities; H ( Pi) is the information entropy of random walks within the community; \(p^i_{\rm{ \circlearrowright }}\) is the sum of the probability of visiting each node in the community i and the probability of exiting the community i.

Weighted Interconnection Density Calculation

As the community structure of the network plays a role in promoting transmission within communities while inhibiting transmission between communities, the level of connectivity between communities has significant implications. When a community has fewer connections to other communities, the nodes within that community primarily transmit the disease to their neighboring nodes within the same community. Their impact on other communities is relatively minimal, making the hub nodes within the community particularly crucial. On the other hand, when a community has numerous connections to other communities, the nodes within that community possess the ability to spread the disease to neighboring communities. In this scenario, identifying bridge nodes between communities becomes essential. Therefore, there is a need to quantify the interaction between communities and distinguish the contribution of nodes to transmission within their own community and transmission to other communities. For community Ck, in combination with all edges involving nodes within and outside Ck, along with their strengths, the weighted interconnection density of community Ck is defined as follows:

$$ \rho_{C_k } = \frac{{\sum\limits_ {v_i \in C_k} {S_{^{in} } (i)/(S_{in} (i) + S_{^{out} } (i))} }}{N_k } $$
(3)

where Sin (i) represents the internal community weight, which is the sum of edge weights between node vi and neighboring nodes within the community. Sout (i) represents the external community weight, which is the sum of edge weights between node vi and neighboring nodes outside the community.

Node GHB Value Calculation

The relationships between internal and external connections within communities, as well as the impact of community size, are reflected through weighted interconnection density and the number of community nodes. If node vi belongs to community Ck and its neighboring community is Cl, considering the reciprocal of edge weights as the distance between nodes, along with the internal and external weights of node vi and its neighboring node vj, as well as the weighted interconnection density of their respective communities and community size, we calculate the centrality of node vi using the calculation method of the gravity model. The GHB (Gravity Hub Bridge) value of node vi is calculated as follows:

$$ GHB(i) = \rho_{C_k } H(i) + (1 - \rho_{C_l } )B(i) $$
(4)
$$ H(i) = N_k \sum_{v_j \in I(i),v_j \in C_k } {\frac{{S_{in} (i)S_{in} (j)}}{{1/w_{ij}^2 }}} $$
(5)
$$ B(i) = \sum_{v_j \in I(i),v_j \notin C_k } {N_l } \frac{{S_{out} (i)S_{in} (j)}}{{1/w_{ij}^2 }} $$
(6)

where I(i) is the set of neighboring nodes of node vi.

2.2 Baseline Method

Existing research has selected degree, strength, and k-shell decomposition as methods for identifying super spreaders in public transportation. However, as the passenger contact network is a weighted network, the s-shell decomposition method [9] is an extension of k-shell decomposition for weighted networks. In this study, we have chosen strength (NS) and the s-shell decomposition method (s-shell) as baseline methods for comparison. Additionally, we have selected weighted betweenness centrality (WBC) [10], weighted eigenvector centrality (WEC) [11], and weighted gravity model (Gravity) [12] as baseline methods for comparative analysis alongside GHB.

1) NS: The NS of node vi is the sum of the edge weights connected to it.

$$ NS(i) = \sum_{j \in I(i)} {w_{ij} } $$
(7)

2) WBC: The WBC of node vi is the number of shortest paths passing through this node in the weighted network, reflecting the hub of node propagation in the network.

$$ WBC(i) = \sum_{i \ne s,i \ne j,s \ne j} {\frac{{d_{sj} (i)}}{{d_{sj} }}} $$
(8)

where dsj is the number of all shortest paths from node vs to node vj in the weighted network, and dsj (i) is the number of shortest paths passing through node vi in dsj.

3) WEC: The WEC evaluates the importance of the neighboring node by using the information of the neighboring nodes and calculates the weighted adjacency matrix corresponding to the complex network.

$$ WEC(i) = \lambda^{ - 1} \sum_{j = 1}^N {w_{ij} e_j } $$
(9)

where λ is the largest eigenvalue of the weighted adjacency matrix W, and the eigenvector corresponding to W is denoted as e = (e1,e2,…,en)T.

4) s-shell: Remove the node with the lowest NS in the network, all the removed nodes have an s-shell value of 1, denoted as wks = 1, and remove the remaining sub-networks, and assign an s-shell value of 2, denoted as wks = 2, and repeat this step until there are no nodes in the network.

5) Gravity: In [12], the degree of each node is regarded as its mass, and the shortest path distance between two nodes is regarded as the distance between them, and an index of gravitational centrality is proposed to identify influential spreaders in complex networks. For the weighted network, this paper replaces the indicators in the formula with the indicators in the weighted network.

$$ Gravity(i) = \sum_{d_{ij} \in \Psi_i } {\frac{NS(i)*NS(j)}{{d_{ij}^2 }}} $$
(10)

where ψi is the set of neighborhoods whose distance to node vi is less than or equal to the given value, set to 3 in [12].

2.3 Evaluation Method

Weighted SIR Model

We use the SIR Model to capture the transmission capability of the super-spreaders. The SIR model categorizes nodes into three health states: susceptible (S), infected (I), and recovered (R). Infected nodes transmit the disease to their neighbors with an infection probability λ, and infected nodes recover and gain immunity at a recovery rate β (in this paper, β = 1). The calculation formula for the infection probability λ is as follows:

$$ \lambda_{ij} = m\lambda_t w_{ij} $$
(11)

where m controls the spread of the epidemic. In this paper, m is set to represent low (0.2), medium (0.5), and high (0.8) levels of epidemiological transmission ability. According to reference [5], λt = 8.17 × 10-4 h-1, where wij represents the edge weight between node vi and node vj.

For each identification method, selecting the first p percent nodes as the initial infected nodes, applying the SIR model to simulate the propagation of the network 100 times, and calculating Rm which represents the average final number of recovered nodes for the tested identification methods. This value is considered as the transmission capability of the super spreaders identified by this method.

Transmission Range Difference

To facilitate the comparison of the transmission capability among different methods, we selected NS, a widely used and easily comprehensible metric, as a reference method. We calculated the transmission range difference, denoted as r, between the other five methods and NS.

$$ r = \frac{R_m - R_s }{{R_s }} $$
(12)

where Rs stands for the average final number of recovered nodes for the NS method.

3 Algorithm Experiment

We used MySQL to capture passenger contact relationships and python to build passenger contact networks and community division.

3.1 Passenger Contact Network Construction

We collected raw data for one week of bus routes in a city, including smart card data, GPS data, transit operation records, and bus stop coordinates. The passenger boarding stations were determined by linking these data sources. Using the assumption of the next trip, the last trip, and the return trip, the alighting stations were determined using the trip chain method. Transfer behaviours were identified using an independent threshold-based public transit transfer model. The methods for determining boarding and alighting points, as well as transfer judgments, are detailed in references [13, 14]. Individual passenger trips with the same travel purpose were combined into public transit Origin-Destination (OD) pairs, and data cleaning was performed to obtain the weekly public transit OD data. Based on the passenger's travel chain data, the algorithm for determining contact among passengers within the same train carriage is designed, as illustrated in Fig. 1.

Fig. 1.
figure 1

Flow chart of passenger contact judgment

3.2 Community Division Result

The passenger contact network is divided into communities and the modular is shown in Table 1. It can be observed that the passenger contact network exhibits a strong community structure. However, a minority of passengers travel on less popular routes during off-peak hours, resulting in the presence of some independent communities.

Table 1. Community division results

3.3 Comparison of Methods

Fig. 2.
figure 2

Transmission range difference plot at different epidemiological levels (a) low level, (b) medium level, (c) high level

Table 2. Transmission range difference table

Using different super spreader identification methods to analyze the passenger contact network will yield different node ranking results. Therefore, it's essential to compare the transmission ranges caused by the identification results of each method. To facilitate comparisons, the transmission range difference for each method is calculated using Eq. (12). Taking Monday as an example, the transmission range difference of super spreaders selected by each method at different epidemiological levels in different proportions is shown in Table 2. As the intensity is the control method, the transmission range difference is 0. The transmission range difference is shown in Fig. 2, where the horizontal axis represents the super spreader ratio p, and the vertical axis indicates the method's relative transmission range difference r. A positive value suggests that the method's identification of public transportation super spreaders has a higher transmission capability compared to the results obtained with NS.

Combining Table 2 and Fig. 2, it can be observed that the GHB method demonstrates the most substantial disparity in transmission range among the identified super spreaders in public transportation. This signifies that the results produced by the GHB method can lead to a broader reach, indicating that GHB is more proficient in identifying public transportation super spreaders when compared to other methods. For epidemics at low, medium, and high levels, the range of transmission range difference for GHB falls within 0.037–0.085, 0.058–0.178, and 0.067–0.254, respectively. The smallest transmission range difference for GHB occurs during low-level epidemics with a recognition rate p of 0.5, which is 0.037. Conversely, in high-level epidemics with a recognition rate p of 0.05, the highest transmission range difference for GHB reaches 0.254. As the recognition rate p of super spreaders decreases, and the epidemiological level increases, the extent of the transmission range difference for GHB also increases. This implies that the identified super spreaders possess a more robust transmission capability. Moreover, as the recognition rate p of super spreaders increases, the results obtained by various methods significantly overlap. In scenarios characterized by lower epidemiological levels, the transmission ability of super spreaders is more constrained. In these cases, the GHB method proves more effective, although the difference is not particularly pronounced.

Each method has its own characteristics, so it has different performance. Among the identification methods, it is observed that the WEC method yields relatively poor transmission capability in its identification results. This is primarily due to the presence of nodes with exceptionally high degree in the network, resulting in a phenomenon where the centrality scores tend to concentrate around these high-degree nodes. As a consequence, the discriminative power among scores for other nodes becomes significantly reduced. Similarly, the s-shell method also demonstrates limited transmission capability in its identification results. This can be attributed to a shared drawback with the k-shell method, namely, the inability to precisely partition nodes within the same shell. Consequently, the s-shell method falls short in providing a nuanced quantification of the transmission capabilities of different nodes located within the same shell. In contrast, the WBC method evaluates the significance of nodes as pivotal transmission points in a weighted network. A higher WBC value implies a greater likelihood of disease transmission occurring through that particular node. The Gravity method takes into account both the NS of nodes and their neighboring nodes, as well as the distance between them. However, it neglects the network's community structure and does not achieve the desired identification results. Conversely, the GHB method takes into account the community structure inherent in the passenger contact network. This approach offers a more accurate representation of the epidemiological interactions among passengers. As a result, its identification results exhibit significantly enhanced transmission ability at different epidemiological levels.

4 Conclusion

The GHB method identifies public transport super spreaders with higher epidemiological transmission ability. As the identification proportion of super spreaders decreases and the epidemiological level increases, the GHB method becomes more effective.

However, this paper does not consider further infection caused by the virus spreading inside the bus after the infected passenger has exited the bus, or exposure caused by the infected passenger while waiting on the platform. More comprehensive pathways of epidemic spread will be considered in our future studies.