Discovering the Hidden Community Structure of Public Transportation Networks


Advances in public transit modeling and smart card technologies can reveal detailed contact patterns of passengers. A natural way to represent such contact patterns is in the form of networks. In this paper we utilize known contact patterns from a public transit assignment model in a major metropolitan city, and propose the development of two novel network structures, each of which elucidate certain aspects of passenger travel behavior. We first propose the development of a transfer network, which can reveal passenger groups that travel together on a given day. Second, we propose the development of a community network, which is derived from the transfer network, and captures the similarity of travel patterns among passengers. We then explore the application of each of these network structures to identify the most frequently used travel paths, i.e., routes and transfers, in the public transit system, and model epidemic spreading risk among passengers of a public transit network, respectively. In the latter our conclusions reinforce previous observations, that routes crossing or connecting to the city center in the morning and afternoon peak hours are the most “dangerous” during an outbreak.

Introduction and Background

Networks can be used to represent public transportation systems from various unique perspectives. Traditionally nodes and edges represent the physical infrastructure of a transportation system, e.g., routes and stops/stations (Háznagy et al. 2015), while recent developments in data collection and modeling allow researchers to accurately map contacts between individuals traveling together. Detailed commuting patterns can be recorded from smart card data (Sun et al.2013, 2014; Bao et al. 2017) or can be the output of activity-based travel models (Brockmann et al. 2006; Song et al. 2010; de Montjoye et al. 2013; Chen et al. 2016; Gardner et al. 2012; Saberi et al. 2018). The resulting passenger contact patterns are both spatial and temporal in nature, and include travel times on specific vehicles and contact with other travelers (Ramadurai and Ukkusuri 2010; Illenberger et al. 2012; Khani et al. 2015). Such detailed travel patterns can then be used for various planning objectives including the estimation of the capacity of infrastructure (Wang et al. 2011), calculating environmental impact (Carlsson-Kanyama and Lindén 1999) or designing surveillance and containment strategies during an epidemic outbreak (Pendyala et al. 2012; Rey et al. 2016). While these methods allow researchers to map contacts between known individuals, the data collection and processing required to recreate a real-world contact network presents many challenges in terms of accuracy and computational complexity, among other issues such as privacy (Huerta and Tsimring 2002; Hoogendoorn and Bovy 2005; Balcan 2009; Salathé 2010; Funk et al. 2010; Nassir et al. 2012).

In this work we specifically focus on the community structure of public transit ridership patterns. Observations on social interactions reveal that people tend to form groups according to their lines of interest, occupation, etc. This concept is known as homophily in the social sciences (Eagle et al. 2009; Yuan and Gay 2006; Chin et al. 2012), while in network modeling we refer to this phenomenon as community structure (Fortunato 2010). One of the most frequently used definitions of this concept was proposed by Newman and Girvan (2004): “In an arbitrary network a community is a set of nodes where the density of connections between the nodes of the community is greater than the density of connections between communities”. Community detection is a diverse field with many applications in various fields of science. Newman’s definition has been extended or replaced by newly proposed algorithms, yet it is still the most intuitive way of describing communities.

The majority of community detection algorithms consider communities as disjoint sets of nodes. Popular approaches include modularity maximization methods (Newman and Girvan 2004; Blondel et al. 2008), information theoretic approaches (Rosvall and Bergstrom 2008), statistical inference (Peixoto 2014; Aicher et al. 2014) and spectral techniques (Krzakala et al. 2013; Newman 2013). Other algorithms allow overlaps between neighboring groups (Bóta and Krész 2015; Lancichinetti et al. 2009; Wu et al. 2012). Community detection algorithms can also be applied to weighted networks, where a value on each edge represents similarity (or distance) between nodes (Bóta and Krész 2015; Aicher et al. 2014). All weighted networks can be transformed into unweighted forms by setting a threshold and omitting edges with weights below the threshold. This technique has been used in real-world applications (Palla et al. 2005; Bóta and Kovács 2014). A selection of excellent reviews on community detection can be found in Fortunato (2010), Xie et al. (2013), and Leskovec et al. (2010).

In this work we propose the development of two novel network structures, namely the transfer network, and the community network. The transfer network captures the movement patterns of atomic passenger groups, i.e., groups of people who travel together on one or more vehicles, mathematically defined as maximal complete subgraphs of the contact network. The community network captures the similarity of travel patterns between passengers, is derived directly from the transfer network, and is constructed using a novel link-based metric called the connection strength. The community network is intended to reveal groups of travelers who may also be more likely to come into contact outside of their daily travel routine e.g.: colleagues traveling to work or children going to school (Yuan and Gay 2006; Chin et al. 2012). A description of each proposed network structure as well as the contact network can be found in Table 1.

Table 1 Network definitions used in this paper, including the contact network and the proposed transfer and community networks

We further demonstrate potential applications of each of these novel network structures. First, the transfer network is implemented to detect the most frequent vehicle trip combinations in the system (bus and train transfers). While there are existing methods in the literature to measure the capacity of public transit systems (Manuel et al. 2006; Wen-Tai and Ching-Fu 2011), our approach provides a more refined view of passenger movement between vehicle trips by tracking the movements of groups of passengers traveling together. Identifying the most relevant vehicle trip combinations can aid public transit authorities in timetable planning and optimizing vehicle assignments.

Second, the community network is applied to evaluate a diffusion process in a transit network, specifically infectious disease spread among passengers. This application complements our previous work (Bóta et al. 2017a), to identify the components of the public transportation system most vulnerable to a bio-security threat. The community network proposed here can further improve the performance of such models due to its novel representation of passenger interactions. We use the proposed network structure to simulate how a disease might spread among the vehicles of the public transportation system and the suburbs of the city, and identify the vehicle trips most likely to carry infected passengers. All applications are evaluated on a real-world case study, the public transit system of Twin Cities, MN, using output from an activity based travel demand model (Khani et al. 2015).

The rest of the paper is structured as follows. In Section 2 we introduce the travel demand model, and how it is used to create passenger contact network. In Section 3 we introduce the transfer network, and illustrate how it can be used to detect frequent vehicle trip combinations. In Section 4 we introduce the community network, including the connection strength value, and illustrate how it can be used to model infectious disease spread. Section 5 contains our conclusions and future research directions.

Model Inputs

The inputs of our work are based on the transit assignment model published in Khani et al. (2015). In this section, we present a description of the assignment model and provide a short summary of the properties of the passenger contact network built from it.

Travel Demand Model

As described in our previous work (Bóta et al. 2017a), public transportation data in this study was obtained from the transit system in Twin Cities region in Minnesota, where 187 routes serve 13,700 stops in the region. Transit network and schedule data were created from General Transit Feed Specification (GTFS), including near 0.5 Million stop-times on a weekday in 2015. Transit passenger trips were obtained from Metropolitan Councils’s activity-based demand model (Cambridge Systematics Inc. 2015), and contained more than 293,000 linked trips (i.e. a passenger trip from an origin to a destination that may include zero or more transfers). The assignment of transit demand to transit network was done using the FAST-TrIPs model (Khani et al. 2015). FAST-TrIPs is a schedule-based transit assignment model that generates hyperpaths using a defined logit route choice model (Khani 2013), assigns individual passengers to the paths using the hyperpath probabilities, and simulates them using a mesoscopic transit passenger simulation module. Since a calibrated transit route choice model was not available for the Twin Cities network, a route choice model from Austin, TX was borrowed from a previous study by the authors (Khani et al. 2014). The model specifies the following route choice utility function:

$$ u = t_{IV} + 1.77t_{WT} + 3.93t_{WK} + 47.73X_{TR} $$

where t,tIV,tWT,tWK and XTR represent path utility, in-vehicle time, waiting time, walking time, and number of transfers in a transit path, respectively. The output of the transit assignment model contains individual passengers’ trajectories, including their walking from the origin to the transit stop, boarding the transit vehicle, alighting and walking to the transfer stop, boarding the next transit vehicle, etc. and finally alighting and walking to the destination. By post-processing the transit assignment model’s outputs, the amount of time each pair of passengers are on-board the same transit vehicle were calculated on a daily basis.

Contact Network

We define a network structure denoted as the contact network based on the outputs of the travel demand model. The nodes of the contact network are passengers, and edges connect passengers if they were traveling on the same vehicle at the same time. The relationship is symmetric, therefore the network is undirected. All passenger movements take place in a temporal setting, which is indicated by two values assigned to the edges of the network: the contact start time on an edge indicates the start time of the contact between the two connected passengers, while a contact duration value indicates the length of the contact in minutes. Since each edge represents a connection between a pair of passengers traveling on a specific vehicle, the id of the vehicle can also be assigned to the edges of the graph. The vehicle id identifies a vehicle trip: a single vehicle traveling on a specific route with a specific start time. The basic properties of the network were shown in Bóta et al. (2017a) along with a detailed analysis regarding the networks potential role in the spreading of epidemics. A short review on the description of the network as appeared in Bóta et al. (2017a) follows.

The contact network corresponding to the Twin cities dataset has 94475 nodes and 6287847 contacts between them. Figure 1a shows the density plot of the contact start times. The distribution has two peaks, one around 7 AM and another around 5 PM, which corresponds to the morning and evening weekday commute. The average number of contacts per person is 136 during the observed day, while the maximum is 827. Figure 1b shows the degree distribution of the graph.

Fig. 1

The distribution of a contact start times and b the degree distribution of the contact network

Figure 2 shows a subgraph of the network structure. Nodes represent passengers and edges connect them if they ride on the same vehicle trip. The dynamics of the network are omitted in this example, i.e. edges are aggregated over the entire day. The black node in the middle represents a passenger with a high number of contacts, who traveled together with all the other nodes shown on this subgraph. The other nodes are colored according to the vehicle trip they rode on, while darker edges indicate longer contact durations.

Fig. 2

A static subgraph of the contact network. A passenger with a high number contacts is represented by the black node in the middle, connected to all other contacts. The colors indicate of the vehicle trips the passengers first met on. The figure was made with Gephi using the Fruchterman-Reingold algorithm

Transfer Network

The first objective of this paper is to detect the movement patterns of passenger groups.

To do this, we define a novel network structure denoted as the transfer network. The transfer network is a directed network, where nodes represent the atomic passenger groups of the contact network and groups are connected if at least one member of a group transfers from one vehicle to another. The weight of the edges denote the number of transfer passengers.

In order to build the transfer network we identify the subgraph corresponding to each vehicle trip, detect atomic passenger groups – defined as maximal cliques – on each of the resulting subgraphs of the contact network and connect the atomic passengers groups according to direction of transfer between vehicle trips. We weight the connections based on the number of transferring passengers. Section 3.1 defines the construction of the transfer network in more detail.

The transfer network proposed in this paper provides a refined way to identify the most frequent vehicle trip combinations in the public transit system, which can aid decision makers in defining timetables and optimizing vehicle assignments. Section 3.2 discusses this application.

Transfer Network Construction

The three steps required to build the transfer network are discussed in the subsequent subsections. Section 3.1.1 defines how the subgraphs corresponding to the vehicle trips are constructed, Section 3.1.2 defines atomic passenger groups as cliques and show how they can be detected in an efficient way and Section 3.1.3 shows how the transfer network can be built from the passenger groups.

Graph Partitioning

We define the subgraph corresponding to each vehicle trip as follows. Let G be the original contact network, V (G) the vertices and E(G) the edges of the network. We divide the original network into subgraphs along vehicle trips. Let T be the set of all vehicle trips in the network and let tiT denote the i-th trip. We define \(G_{t_{i}}\) as the subgraph corresponding to trip ti where, \(V(G_{t_{i}})\) and \(E(G_{t_{i}})\) are the vertices and the edges of trip \(G_{t_{i}}\). We create a subgraph \(G_{t_{i}}\) for all trips tiT. Since a single passenger may travel on multiple trips, the node corresponding to the passenger may appear in multiple subgraphs. Figure 3 shows an example of the partitioning process.

Fig. 3

Partitioning of an example graph along vehicle trips. Each vehicle trip has a corresponding subgraph where nodes are passengers who used the given vehicle trip

Clique Detection

In graph theory, a clique is defined as a fully connected subgraph of a given graph. A maximal clique is a clique, that is not a subgraph of any other clique. Finding the set of all maximal cliques is a well-studied NP-hard problem in graph theory. An arbitrary n-vertex graph may have up to 3n/3 maximal cliques, but this number is much lower in many complex networks, including the contact network studied in this paper. Among the available methods in clique detection we adopt the Bron-Kerbosch (BK) algorithm (Bron and Kerbosch 1973), which has proven its reliability in real-life applications (Eppstein et al. 2010). We apply the BK algorithmFootnote 1 to detect the maximal cliques of all subgraphs \(G_{t_{i}}\) corresponding to all vehicle trips tiT.

The analysis can be further extended if we consider the graphs as weighted according to the contact durations available on the edges. Introducing a weight threshold τ we can prune the edges of the graphs by omitting all edges with contact durations below τ. This allows us to redefine what a connection means in the network: passengers are only considered to be connected if they spend at least a certain amount of time on the same vehicle. We define three thresholds to represent the strength of connection between passengers τ5 = 5 minutes, τ15 = 15 minutes and τ30 = 30 minutes. We chose these thresholds to represent short, medium and long duration contacts between passengers. We prune the edges of the contact network by these values resulting in graphs G5, G15 and G30. In additional, let G0 and τ0 represent the original (uncut) graphs and the corresponding threshold. We run the Bron-Kerbosch clique detection algorithm on these graphs and analyze and compare results. The speed of the detection algorithm is amplified by the fact, that all subgraphs corresponding to vehicle trips are interval graphs. In order to show this speedup, we also run the clique detection algorithm on the original contact network and compare results. Table 2 shows graph size, runtime of the clique detection method on the original contact network and the total runtime of the detection method on all subgraphs for G0, G5, G15 and G30. We can see a significant speedup for all graphs in this analysis. For the larger G0 and G5 graphs the total runtime on all subgraphs corresponding to vehicle trips is 14 times less than on the unpartitioned network.

Table 2 Runtimes of the Bron-Kerbosch on graphs G0,G5,G15,G30. Raw denotes runtime in seconds on the original contact network, while partitioned denotes the total runtime on all subgraphs corresponding to vehicle trips

Graph Building

Let F denote the transfer network as a directed graph where every node vV (F) is a clique from \(G_{t_{i}}\) for all ti. Edges connect nodes v and u if the corresponding cliques do not lie in the same vehicle trip, yet they have at least one common passenger. More formally, for nodes u and v in the transfer network and their corresponding cliques cv and cu in the contact network, u and v is connected if cvcu ≥ 1 and if \(c_{v} \subseteq G_{t_{v}}, c_{u} \subseteq G_{t_{u}}\) then \(G_{t_{v}} \neq G_{t_{u}}\).

The direction of edges correspond to the direction of the transfer between tu and tv, that is whether the passengers of cvcu move from tu to tv or the opposite. We establish the direction by looking at the contact start times for the individuals in cvcu in the following way. For all edges of the transfer network euvE(F), let cuv denote the set of corresponding passengers in G as \(c_{uv}=c_{u} \cap c_{v}, c_{v} \subseteq G_{t_{v}}, c_{u} \subseteq G_{t_{u}}\), \(G_{t_{v}} \neq G_{t_{u}}\). Let \(\alpha _{xy}^{t_{i}}\) denote the contact start time between passengers x and y on vehicle trip ti. For all passengers p,qcuv, if \(min(\alpha _{pq}^{t_{u}}) < min(\alpha _{pq}^{t_{v}}) \), then the direction of the edge euvE(F) is from u to v, else it is from v to u. If |cvcu| = 1, then let cvcu = {p0} and pcu ∖{p0},qcv ∖{p0}. Then \(min_{p}(\alpha _{p_{0}p}^{t_{u}}) < min_{q}(\alpha _{p_{0}q}^{t_{v}})\) decides the direction of the edge as above. We assign integer values wv = |cv| to all vV (F) and wu,v = |cvcu| for all e(u,v) ∈ E(F). Values wv and wu,v represent the amount of passengers corresponding to both the nodes and edges of the transfer network.

Detecting Frequent Vehicle Trip Combinations

The transfer network allows us to identify the most frequent vehicle trip combinations passengers travel on in the public transportation system. Detecting the most frequent combinations helps decision makers in defining timetables and optimizing vehicle assignments.

As before, let T be the set of vehicle trips, and tiT the i-th vehicle trip. We define vehicle trip pairs as follows: for all ti and tj, tij is a vehicle trip pair where ij and ti,tjT, while the set of all vehicle trip pairs is denoted by Tp. We assign a number mij to each vehicle trip pair tijTp indicating the amount of passenger traffic between the corresponding trips. To calculate this value let the set \(V(C_{t_{ij}})\) contain all nodes of the transfer network corresponding to all cliques \(c_{t_{ij}} \in C_{t_{ij}}\) where \(c_{t_{ij}} \in G_{t_{i}} \cup G_{t_{j}}\), and take the subgraph induced by \(V(C_{t_{ij}})\). The number mij is the sum of all edge weights of the induced subgraph. This approach offers a more refined view of passenger traffic between vehicle trips because it represents the movements of passenger groups as opposed the behavior of individual passengers.

To give another dimension to our analysis we compared the frequent vehicle trip combinations in both G0,G5,G15 and G30, that is only considering passengers to be in contact with each other if they travel together for more than τ5 = 5 minutes, τ15 = 15 minutes and τ30 = 30 minutes in addition to the unfiltered network G0. Table 3 shows the five most frequent vehicle trip combinations tij and the amount mij of passenger traffic between them.

Table 3 The five most frequent vehicle trip combinations in G0,G5,G15 and G30. The first number indicates the route number followed by the start time of the sepcific vehicle trip

Almost all trip combinations in Table 3 follow a similar pattern. Results for G0 and G5 are nearly identical. The most frequent combinations for G15 connect two additional suburbs (Oakdale and Stillwater) to G0 and G5, but otherwise are identical. The vehicle trips pairs with the greatest amount of passenger traffic between them are mostly in the morning or afternoon peak hours. Almost all of the trip combinations link one of the outlying suburbs to the city center, indicating the daily commuting patterns of workers or students. Just the urban area of Twin Cities covers 2646 km2 with a population of more than three million. This means that commuters trying to reach the city center from one of the outlying suburbs must travel on two or sometimes three separate routes to get to their destination. The pattern – also shown on Fig. 4 – is the following. Commuters start the journey from one of the smaller outlying suburbs like Ridgedale (route 614), Inver Grove and other suburbs south of St. Paul (route 68) or Oakdale and Stillwater (route 294). Then they change vehicles in the transport hub of one of the major outlying suburbs (St. Paul, Minnetonka, Mall of America in Bloomington), and take an express service to the city center (routes 94, 675, 850, etc..).

Fig. 4

Most frequent vehicle trips combinations in Twin Cities, MN for G0,G5,G15 and G30

While the frequent combinations for G30 are similar, there are differences. While the suburbs these routes connect are different, we can see almost the same patterns as before: travel from one of the smaller outlying suburb to a major one and then to the city centre (combinations 71–61 and 5–901). One specific route (850) is one of the longest express bus route in the city and it includes a long section where the bus doesn’t stop at all. One exception to the pattern is route 54, which connects St. Paul International airport to St. Paul and route 68 connecting to south St. Paul. In terms of time in addition to peak hours, we see late night services as well.

We can summarize, that graphs G0,G5 and G15 behave similarly, showing the movements of passenger groups traveling through the public transit system. The most frequent trip combinations are the ones connecting distant suburbs to the city center during the morning and afternoon peak hours. G30 captures an alternative set of routes corresponding to people who travel together for longer periods for different reasons, like a long service (route 850), the scarcity of services late at night (route 5 at 0:30) or traveling from the airport to a major transport hub (route 54).

Community Network

Next we propose a novel network structure, the community network, which expands the definition of passenger connectivity to be a function of both the number of transfers passengers make together as well as the total amount of time they spend together while traveling. This contrasts the contact network, which simply quantifies passenger connectivity passengers using contact duration on individual vehicle trips. We define a community of passengers as a set of passengers who have common travel patterns, e.g., vehicle trips and/or transfers. In order to build communities, we define and quantify a novel connection strength metric between passengers indicating the similarity of their travel patterns and create a new network structure, the community network based on this value. Section 4.1 outlines this process in more detail.

This network can serve as the basis of a community detection algorithm, but in this paper we take a different approach. We define communities as those connected by edges whose values lie above a predetermined threshold. In Section 4.2 we demonstrate this feature by identifying the commuting patterns of the members of the largest passenger community of the network.

We propose an application of the community network in Section 4.3. We show how the connection strength value can be used to model infectious disease transmission among passengers of a public transit system. Tying to our previous work in Bóta et al. (2017a) we seek to identify the vehicle trips most likely to carry infected passengers during an outbreak.

Community Network Construction

In order to detect the communities of the public transportation network we construct a weighted network structure called community network. The community network connects passengers using on a novel link-based metric, the connection strength. The connection strength s defines the edge weights in the community network, and takes into account the number of transfers a pair of passengers makes together. Thus, if the passengers meet on multiple different vehicle trips, their connection strength s will increase. This method is based on the assumption that passengers who not only travel together but also transfer together have a stronger connection than travelers who are simply present on the same vehicle trip at the same time. Below we explain how this link metric is derived.

Let H denote the new network where the nodes are the passengers, and the set of nodes V(H) includes all passengers who traveled on at least two vehicle trips with another passenger. Thus, the nodes in the community network correspond to the edges of the transfer network. We define the connections and weights between passengers as follows. Nodes u and v in the community network are connected if they are both present in at least two different cliques c1,c2 in two different subgraphs corresponding to vehicle trips, i.e. they traveled together on at least two vehicle-trips (\(u,v \in c_{1} \subseteq G_{t_{1}}\) and \(u,v \in c_{2} \subseteq G_{t_{2}}\)). Let guv denote the number of instances where u and v are members of the same clique, that is guv = |Cuv| where cuvCuv if u,vcuv. Let Tuv be the set of vehicle trips where both u and v are present: tuvTuv if u,vtuv, and let \(g_{uv}^{t_{i}}\) be the number of the cliques in vehicle trip ti where u and v are both present: \(g_{uv}^{t_{i}} = |C_{uv}^{t_{i}}|\) where \(c_{uv}^{t_{i}} \in C_{uv}^{t_{i}}\) if \(u,v \in c_{uv}^{t_{i}} \in G_{t_{i}}\). Thus, the connection strength suv between passengers u and v can be formalized as follows:

$$ s_{uv}=\frac{g_{uv}*(g_{uv}-1)}{2}-\sum\limits_{t_{i} \ in \ T_{uv}}\frac{g_{uv}^{t_{i}}*(g_{uv}^{t_{i}} -1)}{2} $$

Using this definition of connection strength, if a pair of passengers traveled together in guv different atomic passenger groups, then they would have \(\frac {g_{uv}*(g_{uv}-1)}{2}\) different edges between them because the affected nodes form a clique in the transfer network. This way the first part of the equation rewards the movements between different vehicle trips. Since based on our community definition traveling on the same vehicle trip doesn’t indicate strong connection between the passengers, in the second part of the equation we penalize any instance where u and v travels together on the same vehicle trip. The value of the penalty will be the sum of the edges in every vehicle trip where the passenger pair appears more than one time, and the number of the edges in a vehicle trip is counted in the same way as in the first part of the equation.


Algorithm 1 shows the construction of the community network, while Fig. 5 illustrates a few examples for computing s between passengers. On Fig. 5a two passengers travel together in two cliques on vehicle trip t1 and one clique on vehicle trip t2, therefore \( g_{uv}^{t_{1}} = 2, \ g_{uv}^{t_{2}} = 1, \ g_{uv}=3\) and s = 2. A different situation is shown on Fig. 5b where two passengers travel together in three cliques on t3 and two cliques on t4 making \(g_{uv}^{t_{3}} = 3, \ g_{uv}^{t_{4}} = 2, \ g_{uv}=5\) and s = 6. Figure 5c presents a trivial case, when two passengers travel together on a single vehicle trip in four cliques making \( g_{uv}^{t_{1}} = 4, \ g_{uv}=4\) and s = 0.

Fig. 5

Examples of the connection strength between pairs of passengers in three different travel scenarios.. Rectangles represent vehicle trips and circles represent cliques. Edges marked with black increase the connection strength between highlighted passengers. Red edges penalize connection strength because these are on the same vehicle trip. a two passengers travel together in two cliques on vehicle trip t1 and one clique on vehicle trip t2, therefore \( g_{uv}^{t_{1}} = 2, \ g_{uv}^{t_{2}} = 1, \ g_{uv}=3\) and s = 2. b two passengers travel together in three cliques on t3 and two cliques on t4 making \(g_{uv}^{t_{3}} = 3, \ g_{uv}^{t_{4}} = 2, \ g_{uv}=5\) and s = 6. c two passengers travel together on a single vehicle trip in four cliques making \( g_{uv}^{t_{1}} = 4, \ g_{uv}=4\) and s = 0

Passenger communities

In this section we illustrate examples of passenger communities in the public transportation system of Twin Cities MN. The first example seen on Fig. 6 shows subgraphs of the community network constructed from G30, i.e. the contact network only containing edges where contact duration is above 30 minutes. Figure 6a depicts the entire community network. Most of the communities on this network are of size two or three, but there are several larger communities with strong connections between the members. Figure 6b shows a subgraph where edges with weights s < 5 are omitted as well as all nodes with degrees below two. The remaining subgraph contains the largest group of the network, while the largest individual community is depicted on Fig. 6c.

Fig. 6

The community network of Twin Cities, MN. a the whole community network, b a subgraph with edge weights greater than 5, c the largest passenger group of the network

Figure 6c shows the largest group in the network. The group contains nine passengers who traveled together on two different vehicle trips, while the overall time they spent using the public transportation network was almost 1.5 hours. The passengers embarked on route 805 in the morning between 7:08 and 7:16 near Blaine and traveled together to Northtown. They disembarked at 7:48 and waited together for the second vehicle 852 arriving at 8:12 and traveled together to downtown Minneapolis for almost an hour until 9:12 and 9:16. The travel path of the community, shown on Fig. 7, indicates a commuting pattern from one of the suburbs to the city center of Minneapolis.

Fig. 7

The travel path of the passenger community on Fig. 6c traveling from a suburb to the city center

Both contact and community networks reveal contact patterns between passengers of a public transportation system. In the contact network, the weight between individual passengers is defined based on the contact duration on individual vehicle trips. In contrast, the community network provides a more refined way to represent connection strength which takes into account the amount of transfers passengers take together.

The underlying concept behind community structure in networks is homophily, which is a well-studied concept of the social sciences (Eagle et al. 2009; Yuan and Gay 2006; Chin et al. 2012). Homophily states that people tend to form groups according to their lines of interest, occupation, etc. Physical proximity – on public transportation for example – is one of these indicators. Therefore, strong connections on the community network can help us uncover connections in other areas of life like workplace or school or other common interests. It should be noted, that physical proximity does not guarantee another type of connection, it simply increases the likelihood of occurrence for it.

Epidemic Spreading Risk Application

One application of identifying the communities within the transit network, as described in this work, is infrastructure security. Understanding passenger communities enables more efficient and accurate tracking of infectious disease spread, were one to be naturally or maliciously introduced into the public transit system. One of the challenges in modeling epidemic spreading is accurately mapping the relationships between individuals traveling on the same vehicle. A traditional contact network as defined in Section 2 as well as in Bóta et al.(2017a, 2017b) and Sun et al. (2013) simply revelas the set of passengers who were present on the same vehicle and the amount of time they spend on the same vehicle. As shown in Bóta et al. (2017b), this limits the options for how infection spreading probabilities can be defined to be simply a function of contact duration, without the possibility to take into account physical proximity, communication etc. between people.

In contrast, the weights of the community network reveals a deeper level of connections between travelers, as passengers with strong connections in this network may also be connected in other areas of life. Passenger groups identified in the community network are more likely to be traveling within close physical proximity of each other and interacting with each other (Eagle et al. 2009; Yuan and Gay 2006; Chin et al. 2012). As a consequence, the probablity of disease transmission between the travelers belonging to the same community is greater than between two passengers simply sharing the same vehicle without any other connection. This is especially true for large public transportation vehicles like trains or trams, where simply being present on the same vehicle may not imply any kind of connection at all.

In this section we expand upon our previous work in Bóta et al. (2017a) which examined epidemic spreading risk in the same public transportation network (Twin Cities, MN) with the goal of identifying the vehicle trips most likely to carry infected passengers. The analysis was presented in two parts. First, the passenger contact network, in the same form as in Section 2, was used to model a variety of outbreak scenarios. The scenarios differed in the number and distribution of initially infected passengers and the level of infectiousness, represented as the risk of spreading between passengers. The spreading risk was defined as the contact duration multiplied by a constant value. The scenarios were compared in terms of their impact on the network and confirmed previous observations in the literature, that is that the most central vehicle trips in the public transportation network are also the most susceptible to infection. The second part of the analysis focused on a newly proposed vehicle trip network, which represents the public transit network as a network of vehicle trips instead of passengers. We showed that centrality metrics on the vehicle trip network provided a more efficient way to detect the set of vehicle trips most susceptible to disease, and this estimation can be done much faster than running simulations on the contact network.

In the rest of this section we present an alternative way to model the risk of disease spreading between passengers, which exploits the community network structure introduced in this work. We define the infection transmission probability between a pair of passengers to be a function of both the connection strength value used in the community network and contact duration. We believe this multidimensional transmission probability more accurately represents the level of connectivity between pairs of passengers. Using these new transmission functions, we implement similar spreading scenarios to the ones in Bóta et al. (2017a) and identify and rank the vehicle trips susceptible to disease spreading. Below we define the new transmission probability function and the infection model, then present the new vehicle-trips identified to be at highest risk.

Experiment Setup

In order to simulate an epidemic outbreak on the contact network we use the well-known discrete compartmental susceptible-infected (SI) model. In the SI infection process all nodes adopt one of two available states: susceptible (S) or infected (I). Real values denoted as edge infection probabilities are assigned to the links of the network and denoted as we ∈ [0,1]. The infection spreads in a network when susceptible nodes adopt an infected state. This is done in discrete time steps in an iterative manner starting from an initially infected set of nodes. In each iteration each infected node tries to infect its susceptible neighbors according to the transmission probability of the link connecting them. If the attempt is successful, the susceptible node is transformed into an infected one in the next iteration. In this work we limit the number of iterations to five, representing a complete work week with recurrent commuting patterns. The inputs of this model are: 1. a contact network of individuals, 2. an assignment of weights to the links of the network which represent the infection transmission probabilities and 3. the set of initially infected nodes, e.g., individuals.

We use the original structure of the contact network, which connects all pairs of passengers that travel on a vehicle-trip together as the basis of this experiment. The link weights are computed in the following way. Since the nodes and links of the community network are a subset of the nodes and links of the contact network, if a link between two nodes is present in the community network we will assign its connection strength value to the corresponding link in the contact network. We do this in all cases where such an assignment can be made. In order to account for extreme outliers of connection strength we cap all values at 100, then rescale all values to the interval of 0.1 and 0.8 using a standard feature scaling method. The 0.8 upper bound is set because a transmission probability of 1.0 is assumed to be unrealistic. For all links of the contact network that do not have a corresponding link in the community network we assign a uniform infection value of 0.05. In contrast to the duration-based value used in Bóta et al. (2017a), this enables us to capture the increased spreading risk between passengers sharing similar travel patterns, while still allowing disease to spread between travelers simply sharing a vehicle. Future work will explore the model’s sensitivity to the uniform infection assignment using in this work.

Following the procedure in Bóta et al. (2017a), we randomly select 100 passengers from the network to be initially infected. Due to the probabilistic nature of the simulation model, we run the SI infection model k = 10000 times to quantify the likelihood of each nod being in an infectious state at the end of the fifth time step.

Vehicle Trip Ranking

The infection model constructed above provides the likelihood of infection for each passenger in the contact network at the end of the simulation, i.e., after five days. As in Bóta et al. (2017a), we compute a similar infection value for vehicle trips by summing the probability of infection for all passengers on a given vehicle trip. While this does not represent a probability value in a strict sense, this value is proportional to the risk of getting infected on a given vehicle trip.

The routes which contain the highest risk vehicle trips, and corresponding travel times identified in this study are presented in Table 4. Figure 8 shows the routes on the map of the city. Coinciding with the findings in Section 3 and also in Bóta et al. (2017a), the vehicle trips are in the morning and afternoon peak hours. Also coinciding with the results in Bóta et al. (2017a) all of the high risk routes identified go through the city center.

Table 4 Ranking of vehicle trip where infection is most likely to appear
Fig. 8

Vehicle trips most likely to carry infected passengers in the public transit system of Twin Cities, MN

Since the goal of Section 3.2 – identifying the most frequent trip combinations – and the goal of this section – identifying the most risky vehicle trips in the case of an epidemic outbreak – are different, the difference between the selected vehicle trips are not surprising. We observe some similarities between the “highest risk” vehicle trips identified here, and the most frequent vehicle trip combinations (identified in Section 3.2). For example, route 18 connects Bloomington with the city center, while route 61 connects St. Paul to the same destination. We have seen in Section 3.2, that these target-destination pairs are present in the most frequent combinations as well. As these routes connect the most important parts of the city, some similarity is expected.

The set of high risk vehicle trips identified here partially correspond to those identified in Bóta et al. (2017a). Specifically, the methodology proposed in Bóta et al. (2017a) identifies vehicle trips on the routes 7, 9, 11, 14, 17, 25, 61 and 94 as most at risk. These are routes crossing or connecting to the city center in the peak hours, so even though they do not completely match those in Table 4, the pattern they present is the same. The similarity in the set and type of routes identified in both studies points to the critical role of the network structure in modeling outbreak risk. Future research will continue to build on this application, and further explore the robustness and sensitivity of the proposed methodology. The relevant sensitivity analysis is, however, outside the scope of this work.


This study is subject to certain limitations. We note, that the transit demand model used as an input of both works was used to generate commuting patterns for single workday. This limitation is most critical for the epidemic application due to the implicit assumption that daily travel patterns remain constant during a five day work week. While this assumption is supported by a recent study (Sun et al. 2013), more long term observations potentially involving weekends and public holidays would improve the quality of the results presented in this paper.

Further, while we present an alternative metric to quantify disease transmission risk between passengers (compared to contact duration alone), without any real-world observations on an outbreak validating the results of this work remains a challenge, and will be the focus of future research. One solution would be to use epidemic data recorded in another major city (Sun et al. 2014), but adapting such data sources to different environment presents its own set of challenges.


In order to better understand passenger commuting habits in a public transportation system, in this paper we analyzed the contact patterns from a public transit assignment model in a major metropolitan city. We did this by defining two novel network structures to detect and quantify the movement patterns of passengers. The transfer network tracks the movements of atomic passenger group while the community network links passengers with similar travel patterns. We presented applications for both networks, specifically to identify the most frequently used travel paths, i.e., routes and transfers, and model epidemic risk posed by passengers of a public transit network, respectively.

Our main findings correspond to existing literature, while providing a more detailed analysis of passenger commuting habits. We have shown that the most frequent vehicle trip combinations follow a similar pattern. The trip pairs identified using the transfer network identified peak morning and afternoon trips that connect the outlying suburbs with the CBD. The transfer network was demonstrated as a tool to efficiently detect the most frequently used vehicle trip combinations in the public transportation system.

Our results also reinforce previous observations in revealing components of the transit system at risk from an epidemic outbreak. For this purpose, we have shown how the connection strength value can be used to estimate physical contact between passengers and to identify the vehicle trips most likely to carry infected passengers: the routes crossing or connecting to the city center in the peak hours.

In the future we plan to extend the epidemic application and more thoroughly investigate how connection strength can be used to improve risk analysis in a public transportation setting.


  1. 1.

    We implemented the Bron-Kerbosch algorithm with pivoting and degeneracy ordering in the outer most level of recursion in C++. We ran the algorithm on a PC with I7 4790 CPU (3.6 Ghz) and 16GB of RAM.


  1. Aicher C, Jacobs AZ, Clauset A (2014) Learning latent block structure in weighted networks. J Complex Networks 3(2):221–248.

    Article  Google Scholar 

  2. Balcan D (2009) Multiscale mobility networks and the spatial spreading of infectious diseases. Proc Natl Acad Sci USA 106:21,484–21,489.

    Article  Google Scholar 

  3. Bao J, Xu C, Liu P, Wang W (2017) Exploring bikesharing travel patterns and trip purposes using smart card data and online point of interests. Netw Spat Econ 17(4):1231–1253.

    Article  Google Scholar 

  4. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech: Theory Exp 2008(10):P10,008.

    Article  Google Scholar 

  5. Bóta A, Gardner L, Khani A (2017a) Identifying critical components of a public transit system for outbreak control. Netw Spat Econ 17(4):1137–1159.

  6. Bóta A, Gardner L, Khani A (2017b) Modeling the spread of infection in public transit networks: a decision-support tool for outbreak planning and control. In: Transportation research board 96th annual meeting

  7. Bóta A, Kovács L (2014) The community structure of word association graphs. In: Proceedings of the 9th international conference on applied informatics, pp 113–120.

  8. Bóta A, Krész M (2015) A high resolution clique-based overlapping community detection algorithm for small-world networks. Informatica 39(2):177–187

  9. Brockmann D, Hufnagel L, Geisel T (2006) The scaling laws of human travel. Nature 439:462–465.

    Article  Google Scholar 

  10. Bron C, Kerbosch J (1973) Algorithm 457: Finding all cliques of an undirected graph. Commun ACM 16(9):575–577.

    Article  Google Scholar 

  11. Cambridge Systematics Inc. (2015) 2010 travel behavior inventory: model estimation and validation report prepared for metropolitan council

  12. Carlsson-Kanyama A, Lindén AL (1999) Travel patterns and environmental effects now and in the future: implications of differences in energy consumption among socio-economic groups. Ecol Econ 30(3):405–417.

    Article  Google Scholar 

  13. Chen N, Gardner L, Rey D (2016) A bi-level optimization model for the development of real-time strategies to minimize epidemic spreading risk in air traffic networks. Transportation Research Record: Journal of the Transportation Research Board 2569:62–69.

    Article  Google Scholar 

  14. Chin A, Xu B, Yin F, Wang X, Wang W, Fan X, Hong D, Wang Y (2012) Using proximity and homophily to connect conference attendees in a mobile social network. In: 2012 32nd international conference on distributed computing systems workshops. IEEE, pp 79–87

  15. Eagle N, Pentland AS, Lazer D (2009) Inferring friendship network structure by using mobile phone data. Proc Natl Acad Sci 106(36):15,274–15,278.

    Article  Google Scholar 

  16. Eppstein D, Löffler M, Strash D (2010) Listing all maximal cliques in sparse graphs in near-optimal time. In: International symposium on algorithms and computation. Springer, pp 403–414

  17. Fortunato S (2010) Community detection in graphs. Phys Rep 486(3-5):75–174.

    Article  Google Scholar 

  18. Funk S, Salathé M, Jansen VAA (2010) Modelling the influence of human behaviour on the spread of infectious diseases: a review. J R Soc Interface 7:1247–1256.

    Article  Google Scholar 

  19. Gardner L, Fajardo D, Waller S (2012) Inferring infection-spreading links in an air traffic network. Transportation Research Record: Journal of the Transportation Research Board 2300(1):13–21.

    Article  Google Scholar 

  20. Háznagy A, Fi I, London A, Németh T (2015) Complex network analysis of public transportation networks: a comprehensive study. In: 2015 international conference on models and technologies for intelligent transportation systems (MT-ITS). IEEE, pp 371–378

  21. Hoogendoorn S, Bovy P (2005) Pedestrian travel behavior modeling. Netw Spat Econ 5(2):193–216.

    Article  Google Scholar 

  22. Huerta R, Tsimring LS (2002) Contact tracing and epidemics control in social networks. Phys Rev E Stat Nonlin Soft Matter Phys 66 (056):115.

    Article  Google Scholar 

  23. Illenberger J, Nagel K, Flötteröd G (2012) The role of spatial interaction in social networks. Netw Spat Econ 13(3):1–28.

    Article  Google Scholar 

  24. Khani A (2013) Models and solution algorithms for transit and intermodal passenger assignment (development of fast-trips model). PhD Dissertation, University of Arizona, Tucson AZ, USA

  25. Khani A, Beduhn T, Duthie J, Boyles S, Jafari E (2014) A transit route choice model for application in dynamic transit assignment. Innovations in Travel Modeling. Baltimore, MD

  26. Khani A, Hickman M, Noh H (2015) Trip-based path algorithms using the transit network hierarchy. Netw Spat Econ 15(3):635–653.

    Article  Google Scholar 

  27. Krzakala F, Moore C, Mossel E, Neeman J, Sly A, Zdeborová L, Zhang P (2013) Spectral redemption in clustering sparse networks. Proc Natl Acad Sci 110(52):20,935–20,940.

    Article  Google Scholar 

  28. Lancichinetti A, Fortunato S, Kertész J (2009) Detecting the overlapping and hierarchical community structure in complex networks. New J Phys 11(3):033,015.

    Article  Google Scholar 

  29. Leskovec J, Lang KJ, Mahoney M (2010) Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th international conference on World Wide Web. ACM, pp 631–640, DOI, (to appear in print)

  30. Manuel C, Roberto C, Michael F (2006) A frequency-based assignment model for congested transit networks with strict capacity constraints: characterization and computation of equilibria. Transp Res B Methodol 40(6):437–459.

    Article  Google Scholar 

  31. de Montjoye YA, Hidalgo CA, Verleysen M, Blondel VD (2013) Unique in the crowd: the privacy bounds of human mobility. Sci Rep 3:1376.

    Article  Google Scholar 

  32. Nassir N, Khani A, Hickman M, Noh H (2012) An intermodal optimal multi-destination tour algorithm with dynamic travel times. Transportation Research Record: Journal of the Transportation Research Board 2283:57–66.

    Article  Google Scholar 

  33. Newman ME (2013) Spectral methods for community detection and graph partitioning. Phys Rev E 88(4):042,822.

    Article  Google Scholar 

  34. Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026,113.

    Article  Google Scholar 

  35. Palla G, Derényi I., Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818.

    Article  Google Scholar 

  36. Peixoto TP (2014) Efficient monte carlo and greedy heuristic for the inference of stochastic block models. Phys Rev E 89(1):012,804.

    Article  Google Scholar 

  37. Pendyala R, Kondhuri K, Chiu YC, Hickman M, Noh H, Waddell P, Wang L, You D, Gardner B (2012) Integrated land use-transport model system with dynamic time-dependent activity-travel microsimulation. Transportation Research Record: Journal of the Transportation Research Board 2203:19–27.

    Article  Google Scholar 

  38. Ramadurai G, Ukkusuri S (2010) Dynamic user equilibrium model for combined activity-travel choices using activity-travel supernetwork representation. Netw Spat Econ 10(2):273–292.

    Article  Google Scholar 

  39. Rey D, Gardner L, Waller ST (2016) Finding outbreak trees in networks with limited information. Netw Spat Econ 16(2):687–721.

    Article  Google Scholar 

  40. Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci 105 (4):1118–1123.

    Article  Google Scholar 

  41. Saberi M, Rashidi TH, Ghasri M, Ewe K (2018) A complex network methodology for travel demand model evaluation and validation. Netw Spat Econ, pp 1–23.

  42. Salathé M (2010) A high-resolution human contact network for infectious disease transmission. Proc Natl Acad Sci USA 107:22,020–22,025.

    Article  Google Scholar 

  43. Song C, Qu Z, Blumm N, Barabási AL (2010) Limits of predictability in human mobility. Science 327:1018–1021.

    Article  Google Scholar 

  44. Sun L, Axhausen KW, Lee DH, Cebrian M (2014) Efficient detection of contagious outbreaks in massive metropolitan encounter networks. Sci Rep 4:5099.

    Article  Google Scholar 

  45. Sun L, Axhausen KW, Lee DH, Huang X (2013) Understanding metropolitan patterns of daily encounters. Proceedings of the National Academy of Sciences 110(34):13,774–13,779.

    Article  Google Scholar 

  46. Wang R, Nakamura F, Okamura T, Warita H (2011) How the risk-taking personality influences commute drivers’ departure time choices. Procedia Soc Behav Sci 16(Supplement C):814–819.

    Article  Google Scholar 

  47. Wen-Tai L, Ching-Fu C (2011) Behavioral intentions of public transit passengers-the roles of service quality, perceived value, satisfaction and involvement. Transp Policy 18(2):318–325.

    Article  Google Scholar 

  48. Wu ZH, Lin YF, Gregory S, Wan HY, Tian SF (2012) Balanced multi-label propagation for overlapping community detection in social networks. J Comput Sci Technol 27(3):468–479.

    Article  Google Scholar 

  49. Xie J, Kelley S, Szymanski BK (2013) Overlapping community detection in networks: the state-of-the-art and comparative study. ACM Comput Surv 45(4):43.

    Article  Google Scholar 

  50. Yuan YC, Gay G (2006) Homophily of network ties and bonding and bridging social capital in computer-mediated distributed teams. J Comput-Mediat Commun 11(4):1062–1084.

    Article  Google Scholar 

Download references


The authors are grateful to Metropolitan Council of Twin Cities for sharing the activity-based travel demand model with researchers at the University of Minnesota. Any limitations of this study remains the sole responsibility of the authors. László Hajdu acknowledges the National Research, Development and Innovation Office (NKFIH) for funding the project ”Graph Optimisation and Big Data” (Grant No: SNN-117879) and the support of the EU-funded Hungarian grant EFOP-3.6.3-VEKOP-16-2017-00002. Miklós Krész acknowledges the European Commission for funding the InnoRenew CoE project (Grant Agreement #739574) under the Horizon2020 Widespread-Teaming program and the support of the EU-funded Hungarian grant EFOP-3.6.2-16-2017-00015. András Bóta acknowledges the Olle Engkvist Byggmästare Foundation.


Open access funding provided by Umea University.

Author information



Corresponding author

Correspondence to András Bóta.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hajdu, L., Bóta, A., Krész, M. et al. Discovering the Hidden Community Structure of Public Transportation Networks. Netw Spat Econ 20, 209–231 (2020).

Download citation


  • Network modeling
  • Public transportation
  • Community structure
  • Infrastructure security