Next we propose a novel network structure, the community network, which expands the definition of passenger connectivity to be a function of both the number of transfers passengers make together as well as the total amount of time they spend together while traveling. This contrasts the contact network, which simply quantifies passenger connectivity passengers using contact duration on individual vehicle trips. We define a community of passengers as a set of passengers who have common travel patterns, e.g., vehicle trips and/or transfers. In order to build communities, we define and quantify a novel connection strength metric between passengers indicating the similarity of their travel patterns and create a new network structure, the community network based on this value. Section 4.1 outlines this process in more detail.
This network can serve as the basis of a community detection algorithm, but in this paper we take a different approach. We define communities as those connected by edges whose values lie above a predetermined threshold. In Section 4.2 we demonstrate this feature by identifying the commuting patterns of the members of the largest passenger community of the network.
We propose an application of the community network in Section 4.3. We show how the connection strength value can be used to model infectious disease transmission among passengers of a public transit system. Tying to our previous work in Bóta et al. (2017a) we seek to identify the vehicle trips most likely to carry infected passengers during an outbreak.
Community Network Construction
In order to detect the communities of the public transportation network we construct a weighted network structure called community network. The community network connects passengers using on a novel link-based metric, the connection strength. The connection strength s defines the edge weights in the community network, and takes into account the number of transfers a pair of passengers makes together. Thus, if the passengers meet on multiple different vehicle trips, their connection strength s will increase. This method is based on the assumption that passengers who not only travel together but also transfer together have a stronger connection than travelers who are simply present on the same vehicle trip at the same time. Below we explain how this link metric is derived.
Let H denote the new network where the nodes are the passengers, and the set of nodes V(H) includes all passengers who traveled on at least two vehicle trips with another passenger. Thus, the nodes in the community network correspond to the edges of the transfer network. We define the connections and weights between passengers as follows. Nodes u and v in the community network are connected if they are both present in at least two different cliques c1,c2 in two different subgraphs corresponding to vehicle trips, i.e. they traveled together on at least two vehicle-trips (\(u,v \in c_{1} \subseteq G_{t_{1}}\) and \(u,v \in c_{2} \subseteq G_{t_{2}}\)). Let guv denote the number of instances where u and v are members of the same clique, that is guv = |Cuv| where cuv ∈ Cuv if u,v ∈ cuv. Let Tuv be the set of vehicle trips where both u and v are present: tuv ∈ Tuv if u,v ∈ tuv, and let \(g_{uv}^{t_{i}}\) be the number of the cliques in vehicle trip ti where u and v are both present: \(g_{uv}^{t_{i}} = |C_{uv}^{t_{i}}|\) where \(c_{uv}^{t_{i}} \in C_{uv}^{t_{i}}\) if \(u,v \in c_{uv}^{t_{i}} \in G_{t_{i}}\). Thus, the connection strength suv between passengers u and v can be formalized as follows:
$$ s_{uv}=\frac{g_{uv}*(g_{uv}-1)}{2}-\sum\limits_{t_{i} \ in \ T_{uv}}\frac{g_{uv}^{t_{i}}*(g_{uv}^{t_{i}} -1)}{2} $$
(1)
Using this definition of connection strength, if a pair of passengers traveled together in guv different atomic passenger groups, then they would have \(\frac {g_{uv}*(g_{uv}-1)}{2}\) different edges between them because the affected nodes form a clique in the transfer network. This way the first part of the equation rewards the movements between different vehicle trips. Since based on our community definition traveling on the same vehicle trip doesn’t indicate strong connection between the passengers, in the second part of the equation we penalize any instance where u and v travels together on the same vehicle trip. The value of the penalty will be the sum of the edges in every vehicle trip where the passenger pair appears more than one time, and the number of the edges in a vehicle trip is counted in the same way as in the first part of the equation.
Algorithm 1 shows the construction of the community network, while Fig. 5 illustrates a few examples for computing s between passengers. On Fig. 5a two passengers travel together in two cliques on vehicle trip t1 and one clique on vehicle trip t2, therefore \( g_{uv}^{t_{1}} = 2, \ g_{uv}^{t_{2}} = 1, \ g_{uv}=3\) and s = 2. A different situation is shown on Fig. 5b where two passengers travel together in three cliques on t3 and two cliques on t4 making \(g_{uv}^{t_{3}} = 3, \ g_{uv}^{t_{4}} = 2, \ g_{uv}=5\) and s = 6. Figure 5c presents a trivial case, when two passengers travel together on a single vehicle trip in four cliques making \( g_{uv}^{t_{1}} = 4, \ g_{uv}=4\) and s = 0.
Passenger communities
In this section we illustrate examples of passenger communities in the public transportation system of Twin Cities MN. The first example seen on Fig. 6 shows subgraphs of the community network constructed from G30, i.e. the contact network only containing edges where contact duration is above 30 minutes. Figure 6a depicts the entire community network. Most of the communities on this network are of size two or three, but there are several larger communities with strong connections between the members. Figure 6b shows a subgraph where edges with weights s < 5 are omitted as well as all nodes with degrees below two. The remaining subgraph contains the largest group of the network, while the largest individual community is depicted on Fig. 6c.
Figure 6c shows the largest group in the network. The group contains nine passengers who traveled together on two different vehicle trips, while the overall time they spent using the public transportation network was almost 1.5 hours. The passengers embarked on route 805 in the morning between 7:08 and 7:16 near Blaine and traveled together to Northtown. They disembarked at 7:48 and waited together for the second vehicle 852 arriving at 8:12 and traveled together to downtown Minneapolis for almost an hour until 9:12 and 9:16. The travel path of the community, shown on Fig. 7, indicates a commuting pattern from one of the suburbs to the city center of Minneapolis.
Both contact and community networks reveal contact patterns between passengers of a public transportation system. In the contact network, the weight between individual passengers is defined based on the contact duration on individual vehicle trips. In contrast, the community network provides a more refined way to represent connection strength which takes into account the amount of transfers passengers take together.
The underlying concept behind community structure in networks is homophily, which is a well-studied concept of the social sciences (Eagle et al. 2009; Yuan and Gay 2006; Chin et al. 2012). Homophily states that people tend to form groups according to their lines of interest, occupation, etc. Physical proximity – on public transportation for example – is one of these indicators. Therefore, strong connections on the community network can help us uncover connections in other areas of life like workplace or school or other common interests. It should be noted, that physical proximity does not guarantee another type of connection, it simply increases the likelihood of occurrence for it.
Epidemic Spreading Risk Application
One application of identifying the communities within the transit network, as described in this work, is infrastructure security. Understanding passenger communities enables more efficient and accurate tracking of infectious disease spread, were one to be naturally or maliciously introduced into the public transit system. One of the challenges in modeling epidemic spreading is accurately mapping the relationships between individuals traveling on the same vehicle. A traditional contact network as defined in Section 2 as well as in Bóta et al.(2017a, 2017b) and Sun et al. (2013) simply revelas the set of passengers who were present on the same vehicle and the amount of time they spend on the same vehicle. As shown in Bóta et al. (2017b), this limits the options for how infection spreading probabilities can be defined to be simply a function of contact duration, without the possibility to take into account physical proximity, communication etc. between people.
In contrast, the weights of the community network reveals a deeper level of connections between travelers, as passengers with strong connections in this network may also be connected in other areas of life. Passenger groups identified in the community network are more likely to be traveling within close physical proximity of each other and interacting with each other (Eagle et al. 2009; Yuan and Gay 2006; Chin et al. 2012). As a consequence, the probablity of disease transmission between the travelers belonging to the same community is greater than between two passengers simply sharing the same vehicle without any other connection. This is especially true for large public transportation vehicles like trains or trams, where simply being present on the same vehicle may not imply any kind of connection at all.
In this section we expand upon our previous work in Bóta et al. (2017a) which examined epidemic spreading risk in the same public transportation network (Twin Cities, MN) with the goal of identifying the vehicle trips most likely to carry infected passengers. The analysis was presented in two parts. First, the passenger contact network, in the same form as in Section 2, was used to model a variety of outbreak scenarios. The scenarios differed in the number and distribution of initially infected passengers and the level of infectiousness, represented as the risk of spreading between passengers. The spreading risk was defined as the contact duration multiplied by a constant value. The scenarios were compared in terms of their impact on the network and confirmed previous observations in the literature, that is that the most central vehicle trips in the public transportation network are also the most susceptible to infection. The second part of the analysis focused on a newly proposed vehicle trip network, which represents the public transit network as a network of vehicle trips instead of passengers. We showed that centrality metrics on the vehicle trip network provided a more efficient way to detect the set of vehicle trips most susceptible to disease, and this estimation can be done much faster than running simulations on the contact network.
In the rest of this section we present an alternative way to model the risk of disease spreading between passengers, which exploits the community network structure introduced in this work. We define the infection transmission probability between a pair of passengers to be a function of both the connection strength value used in the community network and contact duration. We believe this multidimensional transmission probability more accurately represents the level of connectivity between pairs of passengers. Using these new transmission functions, we implement similar spreading scenarios to the ones in Bóta et al. (2017a) and identify and rank the vehicle trips susceptible to disease spreading. Below we define the new transmission probability function and the infection model, then present the new vehicle-trips identified to be at highest risk.
Experiment Setup
In order to simulate an epidemic outbreak on the contact network we use the well-known discrete compartmental susceptible-infected (SI) model. In the SI infection process all nodes adopt one of two available states: susceptible (S) or infected (I). Real values denoted as edge infection probabilities are assigned to the links of the network and denoted as we ∈ [0,1]. The infection spreads in a network when susceptible nodes adopt an infected state. This is done in discrete time steps in an iterative manner starting from an initially infected set of nodes. In each iteration each infected node tries to infect its susceptible neighbors according to the transmission probability of the link connecting them. If the attempt is successful, the susceptible node is transformed into an infected one in the next iteration. In this work we limit the number of iterations to five, representing a complete work week with recurrent commuting patterns. The inputs of this model are: 1. a contact network of individuals, 2. an assignment of weights to the links of the network which represent the infection transmission probabilities and 3. the set of initially infected nodes, e.g., individuals.
We use the original structure of the contact network, which connects all pairs of passengers that travel on a vehicle-trip together as the basis of this experiment. The link weights are computed in the following way. Since the nodes and links of the community network are a subset of the nodes and links of the contact network, if a link between two nodes is present in the community network we will assign its connection strength value to the corresponding link in the contact network. We do this in all cases where such an assignment can be made. In order to account for extreme outliers of connection strength we cap all values at 100, then rescale all values to the interval of 0.1 and 0.8 using a standard feature scaling method. The 0.8 upper bound is set because a transmission probability of 1.0 is assumed to be unrealistic. For all links of the contact network that do not have a corresponding link in the community network we assign a uniform infection value of 0.05. In contrast to the duration-based value used in Bóta et al. (2017a), this enables us to capture the increased spreading risk between passengers sharing similar travel patterns, while still allowing disease to spread between travelers simply sharing a vehicle. Future work will explore the model’s sensitivity to the uniform infection assignment using in this work.
Following the procedure in Bóta et al. (2017a), we randomly select 100 passengers from the network to be initially infected. Due to the probabilistic nature of the simulation model, we run the SI infection model k = 10000 times to quantify the likelihood of each nod being in an infectious state at the end of the fifth time step.
Vehicle Trip Ranking
The infection model constructed above provides the likelihood of infection for each passenger in the contact network at the end of the simulation, i.e., after five days. As in Bóta et al. (2017a), we compute a similar infection value for vehicle trips by summing the probability of infection for all passengers on a given vehicle trip. While this does not represent a probability value in a strict sense, this value is proportional to the risk of getting infected on a given vehicle trip.
The routes which contain the highest risk vehicle trips, and corresponding travel times identified in this study are presented in Table 4. Figure 8 shows the routes on the map of the city. Coinciding with the findings in Section 3 and also in Bóta et al. (2017a), the vehicle trips are in the morning and afternoon peak hours. Also coinciding with the results in Bóta et al. (2017a) all of the high risk routes identified go through the city center.
Table 4 Ranking of vehicle trip where infection is most likely to appear Since the goal of Section 3.2 – identifying the most frequent trip combinations – and the goal of this section – identifying the most risky vehicle trips in the case of an epidemic outbreak – are different, the difference between the selected vehicle trips are not surprising. We observe some similarities between the “highest risk” vehicle trips identified here, and the most frequent vehicle trip combinations (identified in Section 3.2). For example, route 18 connects Bloomington with the city center, while route 61 connects St. Paul to the same destination. We have seen in Section 3.2, that these target-destination pairs are present in the most frequent combinations as well. As these routes connect the most important parts of the city, some similarity is expected.
The set of high risk vehicle trips identified here partially correspond to those identified in Bóta et al. (2017a). Specifically, the methodology proposed in Bóta et al. (2017a) identifies vehicle trips on the routes 7, 9, 11, 14, 17, 25, 61 and 94 as most at risk. These are routes crossing or connecting to the city center in the peak hours, so even though they do not completely match those in Table 4, the pattern they present is the same. The similarity in the set and type of routes identified in both studies points to the critical role of the network structure in modeling outbreak risk. Future research will continue to build on this application, and further explore the robustness and sensitivity of the proposed methodology. The relevant sensitivity analysis is, however, outside the scope of this work.