1 Introduction and Background

Networks can be used to represent public transportation systems from various unique perspectives. Traditionally nodes and edges represent the physical infrastructure of a transportation system, e.g., routes and stops/stations (Háznagy et al. 2015), while recent developments in data collection and modeling allow researchers to accurately map contacts between individuals traveling together. Detailed commuting patterns can be recorded from smart card data (Sun et al.2013, 2014; Bao et al. 2017) or can be the output of activity-based travel models (Brockmann et al. 2006; Song et al. 2010; de Montjoye et al. 2013; Chen et al. 2016; Gardner et al. 2012; Saberi et al. 2018). The resulting passenger contact patterns are both spatial and temporal in nature, and include travel times on specific vehicles and contact with other travelers (Ramadurai and Ukkusuri 2010; Illenberger et al. 2012; Khani et al. 2015). Such detailed travel patterns can then be used for various planning objectives including the estimation of the capacity of infrastructure (Wang et al. 2011), calculating environmental impact (Carlsson-Kanyama and Lindén 1999) or designing surveillance and containment strategies during an epidemic outbreak (Pendyala et al. 2012; Rey et al. 2016). While these methods allow researchers to map contacts between known individuals, the data collection and processing required to recreate a real-world contact network presents many challenges in terms of accuracy and computational complexity, among other issues such as privacy (Huerta and Tsimring 2002; Hoogendoorn and Bovy 2005; Balcan 2009; Salathé 2010; Funk et al. 2010; Nassir et al. 2012).

In this work we specifically focus on the community structure of public transit ridership patterns. Observations on social interactions reveal that people tend to form groups according to their lines of interest, occupation, etc. This concept is known as homophily in the social sciences (Eagle et al. 2009; Yuan and Gay 2006; Chin et al. 2012), while in network modeling we refer to this phenomenon as community structure (Fortunato 2010). One of the most frequently used definitions of this concept was proposed by Newman and Girvan (2004): “In an arbitrary network a community is a set of nodes where the density of connections between the nodes of the community is greater than the density of connections between communities”. Community detection is a diverse field with many applications in various fields of science. Newman’s definition has been extended or replaced by newly proposed algorithms, yet it is still the most intuitive way of describing communities.

The majority of community detection algorithms consider communities as disjoint sets of nodes. Popular approaches include modularity maximization methods (Newman and Girvan 2004; Blondel et al. 2008), information theoretic approaches (Rosvall and Bergstrom 2008), statistical inference (Peixoto 2014; Aicher et al. 2014) and spectral techniques (Krzakala et al. 2013; Newman 2013). Other algorithms allow overlaps between neighboring groups (Bóta and Krész 2015; Lancichinetti et al. 2009; Wu et al. 2012). Community detection algorithms can also be applied to weighted networks, where a value on each edge represents similarity (or distance) between nodes (Bóta and Krész 2015; Aicher et al. 2014). All weighted networks can be transformed into unweighted forms by setting a threshold and omitting edges with weights below the threshold. This technique has been used in real-world applications (Palla et al. 2005; Bóta and Kovács 2014). A selection of excellent reviews on community detection can be found in Fortunato (2010), Xie et al. (2013), and Leskovec et al. (2010).

In this work we propose the development of two novel network structures, namely the transfer network, and the community network. The transfer network captures the movement patterns of atomic passenger groups, i.e., groups of people who travel together on one or more vehicles, mathematically defined as maximal complete subgraphs of the contact network. The community network captures the similarity of travel patterns between passengers, is derived directly from the transfer network, and is constructed using a novel link-based metric called the connection strength. The community network is intended to reveal groups of travelers who may also be more likely to come into contact outside of their daily travel routine e.g.: colleagues traveling to work or children going to school (Yuan and Gay 2006; Chin et al. 2012). A description of each proposed network structure as well as the contact network can be found in Table 1.

Table 1 Network definitions used in this paper, including the contact network and the proposed transfer and community networks

We further demonstrate potential applications of each of these novel network structures. First, the transfer network is implemented to detect the most frequent vehicle trip combinations in the system (bus and train transfers). While there are existing methods in the literature to measure the capacity of public transit systems (Manuel et al. 2006; Wen-Tai and Ching-Fu 2011), our approach provides a more refined view of passenger movement between vehicle trips by tracking the movements of groups of passengers traveling together. Identifying the most relevant vehicle trip combinations can aid public transit authorities in timetable planning and optimizing vehicle assignments.

Second, the community network is applied to evaluate a diffusion process in a transit network, specifically infectious disease spread among passengers. This application complements our previous work (Bóta et al. 2017a), to identify the components of the public transportation system most vulnerable to a bio-security threat. The community network proposed here can further improve the performance of such models due to its novel representation of passenger interactions. We use the proposed network structure to simulate how a disease might spread among the vehicles of the public transportation system and the suburbs of the city, and identify the vehicle trips most likely to carry infected passengers. All applications are evaluated on a real-world case study, the public transit system of Twin Cities, MN, using output from an activity based travel demand model (Khani et al. 2015).

The rest of the paper is structured as follows. In Section 2 we introduce the travel demand model, and how it is used to create passenger contact network. In Section 3 we introduce the transfer network, and illustrate how it can be used to detect frequent vehicle trip combinations. In Section 4 we introduce the community network, including the connection strength value, and illustrate how it can be used to model infectious disease spread. Section 5 contains our conclusions and future research directions.

2 Model Inputs

The inputs of our work are based on the transit assignment model published in Khani et al. (2015). In this section, we present a description of the assignment model and provide a short summary of the properties of the passenger contact network built from it.

2.1 Travel Demand Model

As described in our previous work (Bóta et al. 2017a), public transportation data in this study was obtained from the transit system in Twin Cities region in Minnesota, where 187 routes serve 13,700 stops in the region. Transit network and schedule data were created from General Transit Feed Specification (GTFS), including near 0.5 Million stop-times on a weekday in 2015. Transit passenger trips were obtained from Metropolitan Councils’s activity-based demand model (Cambridge Systematics Inc. 2015), and contained more than 293,000 linked trips (i.e. a passenger trip from an origin to a destination that may include zero or more transfers). The assignment of transit demand to transit network was done using the FAST-TrIPs model (Khani et al. 2015). FAST-TrIPs is a schedule-based transit assignment model that generates hyperpaths using a defined logit route choice model (Khani 2013), assigns individual passengers to the paths using the hyperpath probabilities, and simulates them using a mesoscopic transit passenger simulation module. Since a calibrated transit route choice model was not available for the Twin Cities network, a route choice model from Austin, TX was borrowed from a previous study by the authors (Khani et al. 2014). The model specifies the following route choice utility function:

$$ u = t_{IV} + 1.77t_{WT} + 3.93t_{WK} + 47.73X_{TR} $$

where t,tIV,tWT,tWK and XTR represent path utility, in-vehicle time, waiting time, walking time, and number of transfers in a transit path, respectively. The output of the transit assignment model contains individual passengers’ trajectories, including their walking from the origin to the transit stop, boarding the transit vehicle, alighting and walking to the transfer stop, boarding the next transit vehicle, etc. and finally alighting and walking to the destination. By post-processing the transit assignment model’s outputs, the amount of time each pair of passengers are on-board the same transit vehicle were calculated on a daily basis.

2.2 Contact Network

We define a network structure denoted as the contact network based on the outputs of the travel demand model. The nodes of the contact network are passengers, and edges connect passengers if they were traveling on the same vehicle at the same time. The relationship is symmetric, therefore the network is undirected. All passenger movements take place in a temporal setting, which is indicated by two values assigned to the edges of the network: the contact start time on an edge indicates the start time of the contact between the two connected passengers, while a contact duration value indicates the length of the contact in minutes. Since each edge represents a connection between a pair of passengers traveling on a specific vehicle, the id of the vehicle can also be assigned to the edges of the graph. The vehicle id identifies a vehicle trip: a single vehicle traveling on a specific route with a specific start time. The basic properties of the network were shown in Bóta et al. (2017a) along with a detailed analysis regarding the networks potential role in the spreading of epidemics. A short review on the description of the network as appeared in Bóta et al. (2017a) follows.

The contact network corresponding to the Twin cities dataset has 94475 nodes and 6287847 contacts between them. Figure 1a shows the density plot of the contact start times. The distribution has two peaks, one around 7 AM and another around 5 PM, which corresponds to the morning and evening weekday commute. The average number of contacts per person is 136 during the observed day, while the maximum is 827. Figure 1b shows the degree distribution of the graph.

Fig. 1
figure 1

The distribution of a contact start times and b the degree distribution of the contact network

Figure 2 shows a subgraph of the network structure. Nodes represent passengers and edges connect them if they ride on the same vehicle trip. The dynamics of the network are omitted in this example, i.e. edges are aggregated over the entire day. The black node in the middle represents a passenger with a high number of contacts, who traveled together with all the other nodes shown on this subgraph. The other nodes are colored according to the vehicle trip they rode on, while darker edges indicate longer contact durations.

Fig. 2
figure 2

A static subgraph of the contact network. A passenger with a high number contacts is represented by the black node in the middle, connected to all other contacts. The colors indicate of the vehicle trips the passengers first met on. The figure was made with Gephi using the Fruchterman-Reingold algorithm

3 Transfer Network

The first objective of this paper is to detect the movement patterns of passenger groups.

To do this, we define a novel network structure denoted as the transfer network. The transfer network is a directed network, where nodes represent the atomic passenger groups of the contact network and groups are connected if at least one member of a group transfers from one vehicle to another. The weight of the edges denote the number of transfer passengers.

In order to build the transfer network we identify the subgraph corresponding to each vehicle trip, detect atomic passenger groups – defined as maximal cliques – on each of the resulting subgraphs of the contact network and connect the atomic passengers groups according to direction of transfer between vehicle trips. We weight the connections based on the number of transferring passengers. Section 3.1 defines the construction of the transfer network in more detail.

The transfer network proposed in this paper provides a refined way to identify the most frequent vehicle trip combinations in the public transit system, which can aid decision makers in defining timetables and optimizing vehicle assignments. Section 3.2 discusses this application.

3.1 Transfer Network Construction

The three steps required to build the transfer network are discussed in the subsequent subsections. Section 3.1.1 defines how the subgraphs corresponding to the vehicle trips are constructed, Section 3.1.2 defines atomic passenger groups as cliques and show how they can be detected in an efficient way and Section 3.1.3 shows how the transfer network can be built from the passenger groups.

3.1.1 Graph Partitioning

We define the subgraph corresponding to each vehicle trip as follows. Let G be the original contact network, V (G) the vertices and E(G) the edges of the network. We divide the original network into subgraphs along vehicle trips. Let T be the set of all vehicle trips in the network and let tiT denote the i-th trip. We define \(G_{t_{i}}\) as the subgraph corresponding to trip ti where, \(V(G_{t_{i}})\) and \(E(G_{t_{i}})\) are the vertices and the edges of trip \(G_{t_{i}}\). We create a subgraph \(G_{t_{i}}\) for all trips tiT. Since a single passenger may travel on multiple trips, the node corresponding to the passenger may appear in multiple subgraphs. Figure 3 shows an example of the partitioning process.

Fig. 3
figure 3

Partitioning of an example graph along vehicle trips. Each vehicle trip has a corresponding subgraph where nodes are passengers who used the given vehicle trip

3.1.2 Clique Detection

In graph theory, a clique is defined as a fully connected subgraph of a given graph. A maximal clique is a clique, that is not a subgraph of any other clique. Finding the set of all maximal cliques is a well-studied NP-hard problem in graph theory. An arbitrary n-vertex graph may have up to 3n/3 maximal cliques, but this number is much lower in many complex networks, including the contact network studied in this paper. Among the available methods in clique detection we adopt the Bron-Kerbosch (BK) algorithm (Bron and Kerbosch 1973), which has proven its reliability in real-life applications (Eppstein et al. 2010). We apply the BK algorithmFootnote 1 to detect the maximal cliques of all subgraphs \(G_{t_{i}}\) corresponding to all vehicle trips tiT.

The analysis can be further extended if we consider the graphs as weighted according to the contact durations available on the edges. Introducing a weight threshold τ we can prune the edges of the graphs by omitting all edges with contact durations below τ. This allows us to redefine what a connection means in the network: passengers are only considered to be connected if they spend at least a certain amount of time on the same vehicle. We define three thresholds to represent the strength of connection between passengers τ5 = 5 minutes, τ15 = 15 minutes and τ30 = 30 minutes. We chose these thresholds to represent short, medium and long duration contacts between passengers. We prune the edges of the contact network by these values resulting in graphs G5, G15 and G30. In additional, let G0 and τ0 represent the original (uncut) graphs and the corresponding threshold. We run the Bron-Kerbosch clique detection algorithm on these graphs and analyze and compare results. The speed of the detection algorithm is amplified by the fact, that all subgraphs corresponding to vehicle trips are interval graphs. In order to show this speedup, we also run the clique detection algorithm on the original contact network and compare results. Table 2 shows graph size, runtime of the clique detection method on the original contact network and the total runtime of the detection method on all subgraphs for G0, G5, G15 and G30. We can see a significant speedup for all graphs in this analysis. For the larger G0 and G5 graphs the total runtime on all subgraphs corresponding to vehicle trips is 14 times less than on the unpartitioned network.

Table 2 Runtimes of the Bron-Kerbosch on graphs G0,G5,G15,G30. Raw denotes runtime in seconds on the original contact network, while partitioned denotes the total runtime on all subgraphs corresponding to vehicle trips

3.1.3 Graph Building

Let F denote the transfer network as a directed graph where every node vV (F) is a clique from \(G_{t_{i}}\) for all ti. Edges connect nodes v and u if the corresponding cliques do not lie in the same vehicle trip, yet they have at least one common passenger. More formally, for nodes u and v in the transfer network and their corresponding cliques cv and cu in the contact network, u and v is connected if cvcu ≥ 1 and if \(c_{v} \subseteq G_{t_{v}}, c_{u} \subseteq G_{t_{u}}\) then \(G_{t_{v}} \neq G_{t_{u}}\).

The direction of edges correspond to the direction of the transfer between tu and tv, that is whether the passengers of cvcu move from tu to tv or the opposite. We establish the direction by looking at the contact start times for the individuals in cvcu in the following way. For all edges of the transfer network euvE(F), let cuv denote the set of corresponding passengers in G as \(c_{uv}=c_{u} \cap c_{v}, c_{v} \subseteq G_{t_{v}}, c_{u} \subseteq G_{t_{u}}\), \(G_{t_{v}} \neq G_{t_{u}}\). Let \(\alpha _{xy}^{t_{i}}\) denote the contact start time between passengers x and y on vehicle trip ti. For all passengers p,qcuv, if \(min(\alpha _{pq}^{t_{u}}) < min(\alpha _{pq}^{t_{v}}) \), then the direction of the edge euvE(F) is from u to v, else it is from v to u. If |cvcu| = 1, then let cvcu = {p0} and pcu ∖{p0},qcv ∖{p0}. Then \(min_{p}(\alpha _{p_{0}p}^{t_{u}}) < min_{q}(\alpha _{p_{0}q}^{t_{v}})\) decides the direction of the edge as above. We assign integer values wv = |cv| to all vV (F) and wu,v = |cvcu| for all e(u,v) ∈ E(F). Values wv and wu,v represent the amount of passengers corresponding to both the nodes and edges of the transfer network.

3.2 Detecting Frequent Vehicle Trip Combinations

The transfer network allows us to identify the most frequent vehicle trip combinations passengers travel on in the public transportation system. Detecting the most frequent combinations helps decision makers in defining timetables and optimizing vehicle assignments.

As before, let T be the set of vehicle trips, and tiT the i-th vehicle trip. We define vehicle trip pairs as follows: for all ti and tj, tij is a vehicle trip pair where ij and ti,tjT, while the set of all vehicle trip pairs is denoted by Tp. We assign a number mij to each vehicle trip pair tijTp indicating the amount of passenger traffic between the corresponding trips. To calculate this value let the set \(V(C_{t_{ij}})\) contain all nodes of the transfer network corresponding to all cliques \(c_{t_{ij}} \in C_{t_{ij}}\) where \(c_{t_{ij}} \in G_{t_{i}} \cup G_{t_{j}}\), and take the subgraph induced by \(V(C_{t_{ij}})\). The number mij is the sum of all edge weights of the induced subgraph. This approach offers a more refined view of passenger traffic between vehicle trips because it represents the movements of passenger groups as opposed the behavior of individual passengers.

To give another dimension to our analysis we compared the frequent vehicle trip combinations in both G0,G5,G15 and G30, that is only considering passengers to be in contact with each other if they travel together for more than τ5 = 5 minutes, τ15 = 15 minutes and τ30 = 30 minutes in addition to the unfiltered network G0. Table 3 shows the five most frequent vehicle trip combinations tij and the amount mij of passenger traffic between them.

Table 3 The five most frequent vehicle trip combinations in G0,G5,G15 and G30. The first number indicates the route number followed by the start time of the sepcific vehicle trip

Almost all trip combinations in Table 3 follow a similar pattern. Results for G0 and G5 are nearly identical. The most frequent combinations for G15 connect two additional suburbs (Oakdale and Stillwater) to G0 and G5, but otherwise are identical. The vehicle trips pairs with the greatest amount of passenger traffic between them are mostly in the morning or afternoon peak hours. Almost all of the trip combinations link one of the outlying suburbs to the city center, indicating the daily commuting patterns of workers or students. Just the urban area of Twin Cities covers 2646 km2 with a population of more than three million. This means that commuters trying to reach the city center from one of the outlying suburbs must travel on two or sometimes three separate routes to get to their destination. The pattern – also shown on Fig. 4 – is the following. Commuters start the journey from one of the smaller outlying suburbs like Ridgedale (route 614), Inver Grove and other suburbs south of St. Paul (route 68) or Oakdale and Stillwater (route 294). Then they change vehicles in the transport hub of one of the major outlying suburbs (St. Paul, Minnetonka, Mall of America in Bloomington), and take an express service to the city center (routes 94, 675, 850, etc..).

Fig. 4
figure 4

Most frequent vehicle trips combinations in Twin Cities, MN for G0,G5,G15 and G30

While the frequent combinations for G30 are similar, there are differences. While the suburbs these routes connect are different, we can see almost the same patterns as before: travel from one of the smaller outlying suburb to a major one and then to the city centre (combinations 71–61 and 5–901). One specific route (850) is one of the longest express bus route in the city and it includes a long section where the bus doesn’t stop at all. One exception to the pattern is route 54, which connects St. Paul International airport to St. Paul and route 68 connecting to south St. Paul. In terms of time in addition to peak hours, we see late night services as well.

We can summarize, that graphs G0,G5 and G15 behave similarly, showing the movements of passenger groups traveling through the public transit system. The most frequent trip combinations are the ones connecting distant suburbs to the city center during the morning and afternoon peak hours. G30 captures an alternative set of routes corresponding to people who travel together for longer periods for different reasons, like a long service (route 850), the scarcity of services late at night (route 5 at 0:30) or traveling from the airport to a major transport hub (route 54).

4 Community Network

Next we propose a novel network structure, the community network, which expands the definition of passenger connectivity to be a function of both the number of transfers passengers make together as well as the total amount of time they spend together while traveling. This contrasts the contact network, which simply quantifies passenger connectivity passengers using contact duration on individual vehicle trips. We define a community of passengers as a set of passengers who have common travel patterns, e.g., vehicle trips and/or transfers. In order to build communities, we define and quantify a novel connection strength metric between passengers indicating the similarity of their travel patterns and create a new network structure, the community network based on this value. Section 4.1 outlines this process in more detail.

This network can serve as the basis of a community detection algorithm, but in this paper we take a different approach. We define communities as those connected by edges whose values lie above a predetermined threshold. In Section 4.2 we demonstrate this feature by identifying the commuting patterns of the members of the largest passenger community of the network.

We propose an application of the community network in Section 4.3. We show how the connection strength value can be used to model infectious disease transmission among passengers of a public transit system. Tying to our previous work in Bóta et al. (2017a) we seek to identify the vehicle trips most likely to carry infected passengers during an outbreak.

4.1 Community Network Construction

In order to detect the communities of the public transportation network we construct a weighted network structure called community network. The community network connects passengers using on a novel link-based metric, the connection strength. The connection strength s defines the edge weights in the community network, and takes into account the number of transfers a pair of passengers makes together. Thus, if the passengers meet on multiple different vehicle trips, their connection strength s will increase. This method is based on the assumption that passengers who not only travel together but also transfer together have a stronger connection than travelers who are simply present on the same vehicle trip at the same time. Below we explain how this link metric is derived.

Let H denote the new network where the nodes are the passengers, and the set of nodes V(H) includes all passengers who traveled on at least two vehicle trips with another passenger. Thus, the nodes in the community network correspond to the edges of the transfer network. We define the connections and weights between passengers as follows. Nodes u and v in the community network are connected if they are both present in at least two different cliques c1,c2 in two different subgraphs corresponding to vehicle trips, i.e. they traveled together on at least two vehicle-trips (\(u,v \in c_{1} \subseteq G_{t_{1}}\) and \(u,v \in c_{2} \subseteq G_{t_{2}}\)). Let guv denote the number of instances where u and v are members of the same clique, that is guv = |Cuv| where cuvCuv if u,vcuv. Let Tuv be the set of vehicle trips where both u and v are present: tuvTuv if u,vtuv, and let \(g_{uv}^{t_{i}}\) be the number of the cliques in vehicle trip ti where u and v are both present: \(g_{uv}^{t_{i}} = |C_{uv}^{t_{i}}|\) where \(c_{uv}^{t_{i}} \in C_{uv}^{t_{i}}\) if \(u,v \in c_{uv}^{t_{i}} \in G_{t_{i}}\). Thus, the connection strength suv between passengers u and v can be formalized as follows:

$$ s_{uv}=\frac{g_{uv}*(g_{uv}-1)}{2}-\sum\limits_{t_{i} \ in \ T_{uv}}\frac{g_{uv}^{t_{i}}*(g_{uv}^{t_{i}} -1)}{2} $$
(1)

Using this definition of connection strength, if a pair of passengers traveled together in guv different atomic passenger groups, then they would have \(\frac {g_{uv}*(g_{uv}-1)}{2}\) different edges between them because the affected nodes form a clique in the transfer network. This way the first part of the equation rewards the movements between different vehicle trips. Since based on our community definition traveling on the same vehicle trip doesn’t indicate strong connection between the passengers, in the second part of the equation we penalize any instance where u and v travels together on the same vehicle trip. The value of the penalty will be the sum of the edges in every vehicle trip where the passenger pair appears more than one time, and the number of the edges in a vehicle trip is counted in the same way as in the first part of the equation.

figure a

Algorithm 1 shows the construction of the community network, while Fig. 5 illustrates a few examples for computing s between passengers. On Fig. 5a two passengers travel together in two cliques on vehicle trip t1 and one clique on vehicle trip t2, therefore \( g_{uv}^{t_{1}} = 2, \ g_{uv}^{t_{2}} = 1, \ g_{uv}=3\) and s = 2. A different situation is shown on Fig. 5b where two passengers travel together in three cliques on t3 and two cliques on t4 making \(g_{uv}^{t_{3}} = 3, \ g_{uv}^{t_{4}} = 2, \ g_{uv}=5\) and s = 6. Figure 5c presents a trivial case, when two passengers travel together on a single vehicle trip in four cliques making \( g_{uv}^{t_{1}} = 4, \ g_{uv}=4\) and s = 0.

Fig. 5
figure 5

Examples of the connection strength between pairs of passengers in three different travel scenarios.. Rectangles represent vehicle trips and circles represent cliques. Edges marked with black increase the connection strength between highlighted passengers. Red edges penalize connection strength because these are on the same vehicle trip. a two passengers travel together in two cliques on vehicle trip t1 and one clique on vehicle trip t2, therefore \( g_{uv}^{t_{1}} = 2, \ g_{uv}^{t_{2}} = 1, \ g_{uv}=3\) and s = 2. b two passengers travel together in three cliques on t3 and two cliques on t4 making \(g_{uv}^{t_{3}} = 3, \ g_{uv}^{t_{4}} = 2, \ g_{uv}=5\) and s = 6. c two passengers travel together on a single vehicle trip in four cliques making \( g_{uv}^{t_{1}} = 4, \ g_{uv}=4\) and s = 0

4.2 Passenger communities

In this section we illustrate examples of passenger communities in the public transportation system of Twin Cities MN. The first example seen on Fig. 6 shows subgraphs of the community network constructed from G30, i.e. the contact network only containing edges where contact duration is above 30 minutes. Figure 6a depicts the entire community network. Most of the communities on this network are of size two or three, but there are several larger communities with strong connections between the members. Figure 6b shows a subgraph where edges with weights s < 5 are omitted as well as all nodes with degrees below two. The remaining subgraph contains the largest group of the network, while the largest individual community is depicted on Fig. 6c.

Fig. 6
figure 6

The community network of Twin Cities, MN. a the whole community network, b a subgraph with edge weights greater than 5, c the largest passenger group of the network

Figure 6c shows the largest group in the network. The group contains nine passengers who traveled together on two different vehicle trips, while the overall time they spent using the public transportation network was almost 1.5 hours. The passengers embarked on route 805 in the morning between 7:08 and 7:16 near Blaine and traveled together to Northtown. They disembarked at 7:48 and waited together for the second vehicle 852 arriving at 8:12 and traveled together to downtown Minneapolis for almost an hour until 9:12 and 9:16. The travel path of the community, shown on Fig. 7, indicates a commuting pattern from one of the suburbs to the city center of Minneapolis.

Fig. 7
figure 7

The travel path of the passenger community on Fig. 6c traveling from a suburb to the city center

Both contact and community networks reveal contact patterns between passengers of a public transportation system. In the contact network, the weight between individual passengers is defined based on the contact duration on individual vehicle trips. In contrast, the community network provides a more refined way to represent connection strength which takes into account the amount of transfers passengers take together.

The underlying concept behind community structure in networks is homophily, which is a well-studied concept of the social sciences (Eagle et al. 2009; Yuan and Gay 2006; Chin et al. 2012). Homophily states that people tend to form groups according to their lines of interest, occupation, etc. Physical proximity – on public transportation for example – is one of these indicators. Therefore, strong connections on the community network can help us uncover connections in other areas of life like workplace or school or other common interests. It should be noted, that physical proximity does not guarantee another type of connection, it simply increases the likelihood of occurrence for it.

4.3 Epidemic Spreading Risk Application

One application of identifying the communities within the transit network, as described in this work, is infrastructure security. Understanding passenger communities enables more efficient and accurate tracking of infectious disease spread, were one to be naturally or maliciously introduced into the public transit system. One of the challenges in modeling epidemic spreading is accurately mapping the relationships between individuals traveling on the same vehicle. A traditional contact network as defined in Section 2 as well as in Bóta et al.(2017a, 2017b) and Sun et al. (2013) simply revelas the set of passengers who were present on the same vehicle and the amount of time they spend on the same vehicle. As shown in Bóta et al. (2017b), this limits the options for how infection spreading probabilities can be defined to be simply a function of contact duration, without the possibility to take into account physical proximity, communication etc. between people.

In contrast, the weights of the community network reveals a deeper level of connections between travelers, as passengers with strong connections in this network may also be connected in other areas of life. Passenger groups identified in the community network are more likely to be traveling within close physical proximity of each other and interacting with each other (Eagle et al. 2009; Yuan and Gay 2006; Chin et al. 2012). As a consequence, the probablity of disease transmission between the travelers belonging to the same community is greater than between two passengers simply sharing the same vehicle without any other connection. This is especially true for large public transportation vehicles like trains or trams, where simply being present on the same vehicle may not imply any kind of connection at all.

In this section we expand upon our previous work in Bóta et al. (2017a) which examined epidemic spreading risk in the same public transportation network (Twin Cities, MN) with the goal of identifying the vehicle trips most likely to carry infected passengers. The analysis was presented in two parts. First, the passenger contact network, in the same form as in Section 2, was used to model a variety of outbreak scenarios. The scenarios differed in the number and distribution of initially infected passengers and the level of infectiousness, represented as the risk of spreading between passengers. The spreading risk was defined as the contact duration multiplied by a constant value. The scenarios were compared in terms of their impact on the network and confirmed previous observations in the literature, that is that the most central vehicle trips in the public transportation network are also the most susceptible to infection. The second part of the analysis focused on a newly proposed vehicle trip network, which represents the public transit network as a network of vehicle trips instead of passengers. We showed that centrality metrics on the vehicle trip network provided a more efficient way to detect the set of vehicle trips most susceptible to disease, and this estimation can be done much faster than running simulations on the contact network.

In the rest of this section we present an alternative way to model the risk of disease spreading between passengers, which exploits the community network structure introduced in this work. We define the infection transmission probability between a pair of passengers to be a function of both the connection strength value used in the community network and contact duration. We believe this multidimensional transmission probability more accurately represents the level of connectivity between pairs of passengers. Using these new transmission functions, we implement similar spreading scenarios to the ones in Bóta et al. (2017a) and identify and rank the vehicle trips susceptible to disease spreading. Below we define the new transmission probability function and the infection model, then present the new vehicle-trips identified to be at highest risk.

4.3.1 Experiment Setup

In order to simulate an epidemic outbreak on the contact network we use the well-known discrete compartmental susceptible-infected (SI) model. In the SI infection process all nodes adopt one of two available states: susceptible (S) or infected (I). Real values denoted as edge infection probabilities are assigned to the links of the network and denoted as we ∈ [0,1]. The infection spreads in a network when susceptible nodes adopt an infected state. This is done in discrete time steps in an iterative manner starting from an initially infected set of nodes. In each iteration each infected node tries to infect its susceptible neighbors according to the transmission probability of the link connecting them. If the attempt is successful, the susceptible node is transformed into an infected one in the next iteration. In this work we limit the number of iterations to five, representing a complete work week with recurrent commuting patterns. The inputs of this model are: 1. a contact network of individuals, 2. an assignment of weights to the links of the network which represent the infection transmission probabilities and 3. the set of initially infected nodes, e.g., individuals.

We use the original structure of the contact network, which connects all pairs of passengers that travel on a vehicle-trip together as the basis of this experiment. The link weights are computed in the following way. Since the nodes and links of the community network are a subset of the nodes and links of the contact network, if a link between two nodes is present in the community network we will assign its connection strength value to the corresponding link in the contact network. We do this in all cases where such an assignment can be made. In order to account for extreme outliers of connection strength we cap all values at 100, then rescale all values to the interval of 0.1 and 0.8 using a standard feature scaling method. The 0.8 upper bound is set because a transmission probability of 1.0 is assumed to be unrealistic. For all links of the contact network that do not have a corresponding link in the community network we assign a uniform infection value of 0.05. In contrast to the duration-based value used in Bóta et al. (2017a), this enables us to capture the increased spreading risk between passengers sharing similar travel patterns, while still allowing disease to spread between travelers simply sharing a vehicle. Future work will explore the model’s sensitivity to the uniform infection assignment using in this work.

Following the procedure in Bóta et al. (2017a), we randomly select 100 passengers from the network to be initially infected. Due to the probabilistic nature of the simulation model, we run the SI infection model k = 10000 times to quantify the likelihood of each nod being in an infectious state at the end of the fifth time step.

4.3.2 Vehicle Trip Ranking

The infection model constructed above provides the likelihood of infection for each passenger in the contact network at the end of the simulation, i.e., after five days. As in Bóta et al. (2017a), we compute a similar infection value for vehicle trips by summing the probability of infection for all passengers on a given vehicle trip. While this does not represent a probability value in a strict sense, this value is proportional to the risk of getting infected on a given vehicle trip.

The routes which contain the highest risk vehicle trips, and corresponding travel times identified in this study are presented in Table 4. Figure 8 shows the routes on the map of the city. Coinciding with the findings in Section 3 and also in Bóta et al. (2017a), the vehicle trips are in the morning and afternoon peak hours. Also coinciding with the results in Bóta et al. (2017a) all of the high risk routes identified go through the city center.

Table 4 Ranking of vehicle trip where infection is most likely to appear
Fig. 8
figure 8

Vehicle trips most likely to carry infected passengers in the public transit system of Twin Cities, MN

Since the goal of Section 3.2 – identifying the most frequent trip combinations – and the goal of this section – identifying the most risky vehicle trips in the case of an epidemic outbreak – are different, the difference between the selected vehicle trips are not surprising. We observe some similarities between the “highest risk” vehicle trips identified here, and the most frequent vehicle trip combinations (identified in Section 3.2). For example, route 18 connects Bloomington with the city center, while route 61 connects St. Paul to the same destination. We have seen in Section 3.2, that these target-destination pairs are present in the most frequent combinations as well. As these routes connect the most important parts of the city, some similarity is expected.

The set of high risk vehicle trips identified here partially correspond to those identified in Bóta et al. (2017a). Specifically, the methodology proposed in Bóta et al. (2017a) identifies vehicle trips on the routes 7, 9, 11, 14, 17, 25, 61 and 94 as most at risk. These are routes crossing or connecting to the city center in the peak hours, so even though they do not completely match those in Table 4, the pattern they present is the same. The similarity in the set and type of routes identified in both studies points to the critical role of the network structure in modeling outbreak risk. Future research will continue to build on this application, and further explore the robustness and sensitivity of the proposed methodology. The relevant sensitivity analysis is, however, outside the scope of this work.

5 Limitations

This study is subject to certain limitations. We note, that the transit demand model used as an input of both works was used to generate commuting patterns for single workday. This limitation is most critical for the epidemic application due to the implicit assumption that daily travel patterns remain constant during a five day work week. While this assumption is supported by a recent study (Sun et al. 2013), more long term observations potentially involving weekends and public holidays would improve the quality of the results presented in this paper.

Further, while we present an alternative metric to quantify disease transmission risk between passengers (compared to contact duration alone), without any real-world observations on an outbreak validating the results of this work remains a challenge, and will be the focus of future research. One solution would be to use epidemic data recorded in another major city (Sun et al. 2014), but adapting such data sources to different environment presents its own set of challenges.

6 Conclusions

In order to better understand passenger commuting habits in a public transportation system, in this paper we analyzed the contact patterns from a public transit assignment model in a major metropolitan city. We did this by defining two novel network structures to detect and quantify the movement patterns of passengers. The transfer network tracks the movements of atomic passenger group while the community network links passengers with similar travel patterns. We presented applications for both networks, specifically to identify the most frequently used travel paths, i.e., routes and transfers, and model epidemic risk posed by passengers of a public transit network, respectively.

Our main findings correspond to existing literature, while providing a more detailed analysis of passenger commuting habits. We have shown that the most frequent vehicle trip combinations follow a similar pattern. The trip pairs identified using the transfer network identified peak morning and afternoon trips that connect the outlying suburbs with the CBD. The transfer network was demonstrated as a tool to efficiently detect the most frequently used vehicle trip combinations in the public transportation system.

Our results also reinforce previous observations in revealing components of the transit system at risk from an epidemic outbreak. For this purpose, we have shown how the connection strength value can be used to estimate physical contact between passengers and to identify the vehicle trips most likely to carry infected passengers: the routes crossing or connecting to the city center in the peak hours.

In the future we plan to extend the epidemic application and more thoroughly investigate how connection strength can be used to improve risk analysis in a public transportation setting.