A causal inference approach to measure the vulnerability of urban metro systems

Transit operators need vulnerability measures to understand the level of service degradation under disruptions. This paper contributes to the literature with a novel causal inference approach for estimating station-level vulnerability in metro systems. The empirical analysis is based on large-scale data on historical incidents and population-level passenger demand. This analysis thus obviates the need for assumptions made by previous studies on human behaviour and disruption scenarios. We develop four empirical vulnerability metrics based on the causal impact of disruptions on travel demand, average travel speed and passenger flow distribution. Specifically, the proposed metrics based on the irregularity in passenger flow distribution extends the scope of vulnerability measurement to the entire trip distribution, instead of just analysing the disruption impact on the entry or exit demand (that is, moments of the trip distribution). The unbiased estimates of disruption impact are obtained by adopting a propensity score matching method, which adjusts for the confounding biases caused by non-random occurrence of disruptions. An application of the proposed framework to the London Underground indicates that the vulnerability of a metro station depends on the location, topology, and other characteristics. We find that, in 2013, central London stations are more vulnerable in terms of travel demand loss. However, the loss of average travel speed and irregularity in relative passenger flows reveal that passengers from outer London stations suffer from longer individual delays due to lack of alternative routes.


Introduction
Metros, also known as subways or rapid transit, have become a vital component of public transport.With the advantage of large capacity and high-frequency services, 178 metro systems worldwide carried a total of 53,768 million trips in 2017 (International Union of Public Transport, 2018).Incidents occur frequently in urban metro systems, mainly due to supply-side failures (e.g., signal failures), sudden increase in travel demand (e.g., public concert or football matches) and change in weather conditions (Brazil et al., 2017;Melo et al., 2011;Wan et al., 2015).These incidents can cause service delays and overcrowding, which in turn lead to safety concerns and potential losses in social welfare.For instance, the London Underground encountered 7973 service disrupting incidents of above 2 minutes duration between April 2016 and April 2017, causing a total loss of around 34 million customer hours (Transport for London, 2016-2017;Transport for London, 2019).The Singapore Mass Rapid Transit experienced 47 severe delays that lasted over 30 minutes between 2015and 2017(Land Transport Authority, 2017).
Operators may consider investing in new technologies to improve metro facilities and mitigate the effect of incidents.For instance, the New York City Subway was in a state of emergency in June 2017 after a series of derailments, track fires and overcrowding incidents.The Metropolitan Transportation Authority invested over $8 billion to stabilise and modernise the incident-plagued metro system (Metropolitan Transportation Authority, 2019).It is apparent that metros are willing to invest in their infrastructure systems, but it is often not known how those investments compare in achieving improvements.To facilitate project selection, metros are increasingly relying on disaggregate performance metrics that reveal the most vulnerable parts of the network.Performance can be measured in various ways.Popular examples are risk, resilience, reliability and vulnerability related metrics.These concepts are often confused by researchers as well as well as practitioners.Interested readers can refer to Faturechi and Miller-Hooks (2015) to understand the most agreed relationship among these concepts.In this paper, we focus on the vulnerability of urban metro systems, where the service output of interest will be passenger demand and the average travel speed.
Since the 1990s, the concept of vulnerability has been widely used to characterise the performance of transport systems (Mattsson and Jenelius, 2015).Vulnerability is often defined as a measure of susceptibility of the transport system to incidents (Berdica, 2002;Jenelius et al., 2006).Vulnerability metrics may measure the consequences of disturbances, in the form of service outputs such as train kilometres, passenger volumes or the quality of travelling.Such metrics have important implications in identifying weak stations or links in metro systems and efficiently allocating resources to the most affected areas.Vulnerability metrics can also be used to inform passengers about the general level of inconvenience at disrupted stations, which can help them in adjusting their travel plans to avoid potential delays or crowding.Given the rising interest in utilising vulnerability metrics in disruption prevention and management, obtaining a correct measure of such metrics is crucial.
Traditionally, vulnerability in urban metros is investigated based on complex network theory and graph theory.Complex network theory converts metro networks into graphs, which enables the quantitative measurement of vulnerability in metro systems (Chopra et al, 2016;Derrible and Kennedy, 2010;Yang et al., 2015).The adoption of graph theory has facilitated the evolution of vulnerability indicators from simply capturing the characteristics of network topology to also considering travel demand patterns and their land use dependencies (Jiang et al., 2018).However, most of these studies rely on simulation-based approaches to quantify vulnerability under hypothetical scenarios of disruptions.These simulation experiments are based on assumptions, both in terms of passenger behaviour and the type and scale of disruptions (Lu, 2018;Sun and Guan, 2016;Sun et al., 2015;Sun et al., 2018).With an empirical approach, such assumptions can be avoided, and thus more reliable metrics of vulnerability can be achieved using historical evidence.
The empirical approach is rare but not unique in the literature.The exception we are aware of is Sun et al. (2016), who first detect incidents based on abnormal ridership and apply these real incidents data to assess the vulnerability of the metro system.While, their method has some limitations.First, they assume the occurrence of incidents to be random, which is a strict and unrealistic assumption as we demonstrate in this study.Also, the abnormal ridership may not be a good indicator of incidents if the fluctuation in ridership are merely manifestations of changes in travel demand.
This paper proposes a novel alternative methodology to quantify vulnerability, by empirically estimating the causal impact of service disruptions on travel demand and average travel speed at station-level.The application of a propensity score matching method accounts for the nonrandomness of disruptions and ensures unbiasedness of the causal estimates.We make this approach comprehensive for the entire network, including stations where incidents are not observed, by predicting the level of vulnerability at these stations with a random forest algorithm.In this way, we eliminate the need for ad hoc assumptions on passenger behaviour and the nature of disruptions.
We use London Underground as a case study and apply the methodology with large-scale automated ticketing and incident data.We find that among the most incident affected stations, the gross travel speed of the trips initiated at Canary Wharf, Victoria and Liverpool street decrease by 156%, 121% and 115%, respectively, relative to regular conditions.In terms of passenger demand, the highest reduction in station inflows and outflows are measured at Oxford Circus, Waterloo and London Bridge, which are in this sense the most vulnerable stations of London Underground.In more isolated parts of the network, where alternative routes may not be available, stations may lose up to 98% of their demand due to incidents.These results have the potential to aid investment decisions of the metro operator.
The rest of paper is organised as follows.Section 2 reviews the literature on vulnerability measurement and disruption impact analysis in urban metro systems.Section 3 presents our empirical framework to compute vulnerability metrics.This section discusses the proposed causal inference approach to estimate the unbiased disruption impact, which is the key input in building vulnerability metrics.In Section 4, we analyse the vulnerability of London Underground as a case study.Results are discussed in Section 5. Finally, Section 6 concludes and highlights the potential avenues for future research.

Literature review
Below we provide a contextual review of previous studies related to vulnerability measurement.In Section 2.1, we review the literature on vulnerability quantification in rail transit networks, while Section 2.2 investigates previous attempts to estimate the impact of disruptions.

Measuring the vulnerability of metro systems
There are two traditional methods used to build vulnerability indicators of metro systemstopology-based and system-performance-based analysis.
The topological methods rely on complex network theory to convert the metro network into a scale-free graph, in which nodes represent metro stations, edges represent links between directly connected stations and the weight associated with each edge is computed based on travel time or distance (Derrible and Kennedy, 2010;Mattsson and Jenelius, 2015;Zhang et al., 2011).The changes in the system's connectivity are reflected on graphs by removing nodes or links and vulnerability is entirely governed by the topological structure.For instance, the location importance of metro stations or links is indicated by the number of edges connected to a specific node and the fraction of shortest paths passing through the given node/edge (Sun and Guan, 2016;Sun et al., 2018;Yang et al., 2015;Zhang et al., 2018).Network-level efficiency is indicated by the average of reciprocal shortest path length between any origin-destination (OD) pair.Such global indicators capture the overall reachability as well as the service size of a metro system (Sun et al., 2015;Yang et al., 2015).System-performance-based analyses not only consider the network topology but also incorporate real data on metro operations (e.g., ridership distribution) into vulnerability measurement (LaniM'cleod et al., 2017;Mattsson and Jenelius, 2015).For instance, Sun et al. (2018) use a ridership-based indicatora sum of flows in edges connected with the given nodeto complement the topological measures by integrating passengers' travel preferences.Other studies use passenger delay and demand loss as vulnerability indicators (Adjetey-Bahun et al., 2016;LaniM' cleod et al., 2017;Rodrí guez-Núñez and García-Palomares, 2014).Specifically, passenger delay is summarised by changes in the weighted average of travel time between all OD pairs due to disruptions where weights are station-level passenger loads.Jiang et al. (2018) suggest integrating land use characteristics around stations into vulnerability measurement because metro systems interact with the external environment during incidents.
To quantify vulnerability based on the aforementioned indicators of the system's performance, almost all previous studies adopt simulation-based approaches and assume hypothetical disruption scenarios.The simplest disruption scenario involves a single station or link closure, assuming one node or edge in the graph is out of service.This incident affects the topology structure and passengers' route choice and the differences in the corresponding performance indicators under normal and disrupted scenarios are quantified to measure vulnerability (Sun et al., 2015).More complex disruption scenarios include the closure of two or more non-adjacent stations, failure of an entire line, and sequential closure of stations until the network crashes (Adjetey-Bahun et al., 2016;Chopra et al., 2016;Sun and Guan, 2016;Zhang et al., 2018;Zhang et al., 2018).Ye and Kim (2019) also discuss the case of partial station closure.
Simulation-based studies gained popularity because they do not require incident data and can flexibly control simulation settings to imitate a wider range of possible situations.However, researchers have to make many assumptions to infer passengers' response to virtual disruptions.Without observing passengers' movements during real incidents, the validity of the simulation assumptions is questionable.For example, while quantifying passenger delay indicators, Rodrí guez-Núñez and Garcí a-Palomares (2014) and Adjetey-Bahun et al. (2016) assume that all passengers have the same travel speed and they do not change their destinations under disruptions unless there is no available route.However, in reality, passengers can travel at different speeds, leave the metro system, change their destinations, or reroute during disruptions.As a result, especially for systembased analyses, vulnerability metrics obtained from simulation-based studies may not reflect the true changes in the level of service due to disruptions.There is, therefore, scope to improve vulnerability measurement by empirically estimating the impact of disruptions.

Estimating disruption impact
In an urban rail transit context, early attempts to analyse disruption impact relied on surveys.Rubin et al. (2005) conducted a stated preference survey to understand the psychological and behavioural reactions of travellers to the bombing incident, which happened in London during July 2005.They consider passenger's reduced intention of travelling by the London Underground after the attack as the key indicator.Since stated willingness may not reflect real travel behaviour, Zhu et al. (2017) performed a revealed preference survey to investigate travellers' reactions to transit service disruptions in Washington D.C. Metro.By comparing their actual travel choices before and during the metro shutdown, they find a 20% reduction in demand.Results from such surveys are usually presented as the percentage change in passengers' preferences for travel modes, departure time, and destinations.Although this information is useful, we still need detailed information about delays or demand losses to quantify true disruption impacts.Furthermore, there are inherent limitations of survey-based studies.For instance, repeated observations of a respondent are difficult to collect for a long period because of constraints associated with cost, manpower, recording accuracy, and privacy protection of respondents (Kusakabe and Asakura, 2014).A survey sample also cannot cover all passengers, which may lead to biased estimates of disruption impact if the sample is not representative of the population.
With the wide use of automated fare collection facilities in metro systems, smart card data have become a powerful tool for research related to transit operations and travel behaviour (Pelletier, Tré panier and Morency, 2011).Compared to survey data, the key advantages of smart card data are cost-effectiveness, continuous long-term recording and accurate travel information for each passenger within the system (Kusakabe and Asakura, 2014).Therefore, researchers have started using smart card data to analyse disruption impacts.For instance, Sun et al. (2016) develop a method to identify incidents and conduct trip assignments with/without incidents.They estimate the disruption impact by computing the differences between two assignments in terms of ridership distribution and travel time across all OD pairs.This study does not require extra assumption about passengers' reaction because their actual locations and movements are revealed from smart card data.However, they assume that metro disruptions occur randomly, while in reality, factors such as travel demand, signalling type, passenger behaviour, operating years, rolling stock characteristics and weather conditions have a significant influence on the likelihood of metro failures (Brazil et al., 2017;Melo et al., 2011;Wan et al., 2015).This is a particularly important consideration because the impact estimated from direct comparison of performance indicators before and after disruptions will be biased under non-random occurrence of disruptions.Specifically, a few factors affecting the impact of disruptions (e.g., passenger behaviour and weather conditions) may also affect the occurrence of disruptions, leading to confounding bias in pre-post comparison estimates (Imbens and Rubin, 2015).Some researchers also adopt prediction-based approaches to quantify disruption impact using smart card data.For instance, Silva et al. (2015) propose a framework to predict the exit ridership and model behaviours of passengers under station closure and line segment closure.In a very recent study, Yap and Cats (2020) apply supervised learning approaches to predict the passenger delay caused by incidents.However, these prediction-based studies also cannot disentangle the causal effect of disruptions and can result into biased estimates due to the existence of confounding factors.
Table 1 shows a comparison of recent vulnerability studies and also illustrates the contribution of this research.We conclude this section with a summary of gaps in the literature that we address to obtain more accurate measures of vulnerability: 1. Previous studies on vulnerability metrics of transit systems are largely based on simulation approaches.These studies do not account for the actual behaviour of passengers under disruptions.Basing analyses on empirical data, rather than simulations, obviates the need for making potentially unrealistic assumptions on passengers' movement.
2. In urban metro systems, disruption occurrences can be non-random.Therefore, empirical studies on quantifying disruption impacts should account for this non-randomness to eliminate confounding biases in estimation.
In this paper, we show that both improvements can be made by adopting causal inference methods and calibrating them using large-scale smart card data and incident data.Specifically, the proposed method allows for the non-random occurrence of disruptions and adjusts for potential bias caused by confounding factors.Subsequently, unbiased empirical estimates of disruption impact are used to accurately compute vulnerability metrics of metro systems.

Methodology
From a methodological point of view, our empirical approach has three stages: first, we apply a causal inference method to estimate the impact of disruptions on station-level travel demand and travel speed (see Section 3.1).Then, in Section 3.2.we construct vulnerability metrics based on the disruption impact estimated in the first stage.Finally, the third stage imputes missing vulnerability metrics for non-disrupted stations using machine learning algorithms.Figure 1 illustrates all steps of the proposed empirical framework.

Stage 1: Causal inference method to estimate disruption impact
To evaluate the impact of a disruption on a metro system, we use Rubin's potential outcome framework to establish causality (Rubin, 1974).We define metro disruptions as 'treatments' and the objective of our analysis is to quantify the causal effect of treatments on 'outcomes' related to system performance.Specifically, we are interested in estimating causal effects on travel speed and ridership.From the literature, we know that factors such as passenger demand, weather conditions, network topology and engineering design influence the likelihood of disruption occurrence (Brazil et al., 2017;Melo et al., 2011;Wan et al., 2015).Therefore, the assignment of the treatment is not random.This is important in our context because the factors associated with the assignment of the treatment are also likely to affect the outcomes of interest, and are thus potential confounders in estimation of impacts.Since previous studies on disruption impact have ignored the non-randomness of treatments, their estimated impact may be biased.
We adopt propensity score matching (PSM) methods to address this issue, which potentially eliminates such confounding biases.The propensity score is defined as the conditional probability that a unit receives treatment given its baseline confounding characteristics.If the observed characteristics sufficiently capture the sources of confounding, then the propensity score can be used to consistently estimate impacts given conditional independence between treatment assignment and outcomes (e.g.conditional on the propensity score) (Imbens and Rubin, 2015).This index is obtained by estimating a relationship between treatment assignment and baseline confounding characteristics using a regression model.The estimated propensity score is then used to form various semi-parametric estimators of the treatment effect such as weighting, regression, and matching.In this section, we first provide a contextual formulation of PSM and then describe how we apply PSM to quantify the causal impact of metro disruptions on the performance of metro systems.

Propensity Score Matching (PSM) Methods
The system-level impact, which averages the impact of all disruptions occurred within the metro system, is too generic to represent network vulnerability.Thus, we focus instead on estimating station-level disruption impacts.We define study unit  as the observation of a metro station within a 15-minute interval.The treatment variable, denoted by   ∈ {0, 1}, records whether study unit  at time  is observed in a disrupted (  = 1) or undisrupted state (  = 0).To quantify disruption impacts, we define outcomes of interest as the changed travel demand and average speed of trips that start from the given study unit, denoted by   .
where  is the total number of stations within the metro system, and  is the total number of time intervals during the study period (for example, T=4 if study period is 1 hour).  (0) and   (1) are counterfactual potential outcomes, only one of which is observed.The propensity score, denoted by e(  ), is obtained by regressing   on confounding factors, denoted by   .We discuss potential confounding factors in the empirical study in Section 4.
To derive valid causal inference using PSM we need our model to satisfy three key assumptions.The first one is the conditional independence assumption (CIA), which states that conditional on the observed confounding factors   , the treatment assignment should be independent of the potential outcomes.The advantages of the propensity score stems from a property that this conditional independence can be achieved by just conditioning on a scalar rather than high-dimensional baseline covariates (Rosenbaum and Rubin, 1983).Thus, the CIA based on the propensity score can be written as: The second assumption requires common support in the covariate distributions by treatment status: which states that the conditional distribution of   given   = 1 should overlap with that of the conditional distribution of   given   = 0.This assumption can be tested by comparing the distributions of propensity scores between treatment and control groups.
The third assumption, also known as the stable unit treatment value assumption (SUTVA), requires that the outcome for each unit should be independent of the treatment status of other units (Graham et al., 2014).
If all three assumptions hold, the average treatment effects (ATE) of disruptions on a station  can be derived as (Imbens and Wooldridge, 2009): (2) where  ∈ {1, … ,   } denotes all the disrupted time intervals of station  during the study period (i.e.,   = 1).  () is a set of indices of the closest  control units (in terms of propensity scores) for station  disrupted at time .Thus, ̂ ℎ represents the average of the difference between the outcomes of treated and matched control units.
In the next subsection, we explain how the causal inference framework introduced in Equations ( 1) and ( 2) can be implemented in the present application.The sequence of steps we follow are shown in Figure 1.We first provide details of the propensity score model, followed by description of our matching algorithms and the estimation of disruption impact.

Application of PSM Methods
To predict the propensity score, i.e. probability of encountering disruptions at a metro station within 15-minute interval conditional on the baseline confounding characteristics, we use the logistic regression model with a linear link function: where  is the intercept and  is the vector of regression coefficients related to the vector of confounding factors  {} .In our empirical study, a station with a higher number of incidents in the past is more likely to encounter a new disruption in the future, just like the black spot on highways.
To account for this temporal correlation among disruption occurrence, we ensure that confounding factors contain the history of past disruptions happened on the same day.
Additionally, we also consider a more advanced generalised additive model (GAM), in which the logarithm of the odds ratio is modelled via semi-parametric smoothing splines.A GAM has potential to uncover flexible relationships between the likelihood of disruption occurrence and confounding factors.The GAM with temporal correlation is presented in Equation ( 4): where ( {} ; ) is a flexible spline function of baseline characteristics.After estimating propensity scores, we check the common support (overlap) assumption to ensure the effective matching and reliability of the propensity score estimates (Lechner, 2001).
The next step is matching.Every treated unit  at time  is paired with  similar control units using the value of their propensity scores.Since there is no theoretical consensus on the superiority of matching algorithms, we adopt two commonly used approaches: Subclassification Matching and Nearest Neighbour Matching.We then compare them with different replacement conditions and pairing ratios, and select the one that balances the greatest disparity among the mean of confounding factors.It is also necessary to check the conditional independence assumption after matching.We conduct balancing tests to check whether the disrupted units and the matched units are statistically similar across the domain of confounders.If significant differences are found, we try another specification of the propensity score model and repeat the above-discussed procedure.
In the last step, we estimate station-level disruption impact using Equation (2).Given the matched pairs, the treatment effect for a station at a specific period is estimated as the difference between outcomes of the treated unit and its matched control units.Then the average station-level disruption impact is obtained by averaging these differences across disrupted periods.We separately estimate the average treatment indicators for the following three measures of metro performance: 1. Entry ridership: the number of passengers who enter the study unit.2. Exit ridership: the number of passengers who exit the study unit.3. Average travel speed: the average speed of all trips that start from the study unit.This speed is computed as total travel distances divided by total journey time, where travel distances (track length) are recovered using the shortest path assignment.If the origin station is entirely closed, no passenger can continue their trips and the average speed will be zero.If the origin station is partially closed, this average speed reflects the trips of passengers who remain in the system and keep traveling on other alternative lines.

Stage 2: Constructing vulnerability metrics
We propose three station-level vulnerability metrics that are constructed from the empirical estimates of disruption impacts.i).The Loss of travel demand is expressed as: where    () and    () denote the change in the number of passengers exits and entries, respectively.  is the net change in station-level demand during a 15-minute interval.
ii).The Loss of average travel speed quantifies the decline in level of service experienced by each passenger at a metro station (individual delay), which is expressed as: where    () denotes the decrease in average travel speed of each trip starting from station  during a 15-minute disruption period.By definition,    accounts for the changes in both travel distance and journey time of passengers.
iii).The loss of gross travel speed reflects the loss of passenger kilometres per unit time for the entire station (overall delays), which is expressed as where   denotes the average entry ridership of all disrupted 15-minute intervals at the corresponding station.Thus,    denotes the total decrease in average travel speed for all passengers who start their journeys from station  during a 15-minute service disruption.

Stage 3: Imputing Missing Vulnerability Metrics
Some stations may not encounter any incidents within the study period.Thus, the empirical disruption impact and the vulnerability metrics cannot be estimated directly for these stations.To predict the missing metrics of non-disrupted stations, we estimate a random forest regression model (Hastie et al. 2009): where  ̂  ( {} ) denotes the random forest predictor.In the equation above,  is the number of trees,  {} is a vector of input features (see Table 2 for details).Furthermore, ( {} ;   ) is the output of the  ℎ random forest tree, and   characterizes the  ℎ random forest tree.The random forest regression that we apply here is a combination of a bagging algorithm and ensemble learning techniques.By averaging the output of several trees (or weak learners in boosting terminology), it reduces the overfitting problem.Interested readers can read Hastie et al. (2009) for details of random forest regression algorithms and the reasons behind its superior prediction accuracy.We also benchmark the prediction performance of random forest regression against competing methods such as linear regression and support vector machines.

Case study: London Underground
In 2013, the London Underground (LU) had 270 stations and 11 lines, with a total length of 402 km stretching deep into Greater London.The circle-radial network structure, as shown in Figure 2 (Wikimedia Commons, 2013), is one of the largest and most complex metro systems in the world.Of all lines within the network, one is circular (Circle Line) covering Central London, and the remaining 10 are radial routes converging at the centre of the system.For connectivity among stations, LU has 56 stations connecting 2 lines, 16 stations connecting 3 lines and 8 stations connecting more than 4 lines.LU is also one of the busiest metro systems, with 1.265 billion journeys by the end of 2013 (Transport for London, 2019).Due to over 150 years old operations and enormous passenger demand, disruptions occur frequently in LU.We use the following data to analyse the station-level vulnerability of the LU system.We conducted data processing and analysis using open-source R software (version 3.6.2).
Pseudonymised smart card data: Transport for London (TfL) provided automated fare collection data from 28/10/2013 to 13/12/2013 (35 weekdays) between 6:00 and 24:00.We consider this duration as our study period.The smart card data contain information on transaction date and time, entry and exit locations, encrypted card ID and ticket type (pay as you go/season ticket).The resolution of time stamps exacts to one minute.By using smart card data, we compute entry/exit ridership of each station and obtain passengers' journey time and travel speed.

Incidents and service disruption information:
TfL also provided incident information data for our study period.By mining provided incidents logs, we construct an accurate database of service disruptions, which includes the occurrence time, location and duration of disruptions.

LU network topology information:
We collect data on station coordinates, topology structure and the length of tracks between adjacent stations from open databases authorised by TfL1 .
Weather data: We collect temperature (°C), wind speed (km/h) and rain status from the Weather Underground web portal2 .Based on the observations of over 1000 weather stations around London, we estimate weather conditions for all LU stations at 15-minute resolution for our study period.
LU station characteristics: These station-level features include daily ridership, station age, subsurface/deep-tube stations, terminal stations and screen doors.We also calculate supplementary factors, which capture the characteristics of the affected areas around metro stations.To compute these factors, we define the affected area as a circular area with the radius of 500 metres around the station.We use 2011 UK Census data at Lower Super Output Area (LSOA) level 3 to calculate these supplementary factors.We select all LSOAs whose centroids are within the 500 metres radius of the affected area.We then average the related statistics of the selected LSOAs according to their areas in the circle.Figure 3 illustrates the above process of calculations.To construct the causal inference framework for LU, our study unit is the observation of metro stations during each 15-minute interval within the system service time.We define metro disruption as the state when scheduled train services are interrupted for at least 10 minutes at a station.Over the study period, LU encountered 2894 disruptions lasting from 10 minutes to 11 hours.The aim of causal inference is to estimate the unbiased impact of these observed disruptions (i.e., treatment) on system-performance measures (outcome).The treatment status   is constructed according to the disruption database mentioned in Section 4. To match the disruption duration with the timeframe of study units, we define the following rule to assign the treatment status: if a disruption occurs within a 15-minute interval  of a given station , we regard this study unit as disrupted (i.e.,   = 1), no matter whether disruptions start or end in the middle or last for the entire 15-minute interval.Conversely, if the station is under normal service during entire 15-minute interval, we regard this study unit as un-disrupted (i.e.,   = 0).The treatment outcomes   are presented as three stationlevel performance indicators: entry ridership, exit ridership, and average travel speed.
As discussed earlier, metro disruptions may not occur randomly.We list all potential confounding factors for LU in Table 2, which we use in estimating the propensity score model (Section 3.1).These confounders are selected according to the literature and expertise, including travel demand, weather conditions, engineering design, time of day and past disruptions (Brazil et al., 2017;Melo et al., 2011;Wan et al., 2015).Table 2 also shows available covariates for the imputation of missing vulnerability metrics in Stage 3 (Section 3.3), which not only include some of confounders, but also include supplementary factors of LU station characteristics.

Results
Out of 270 stations of the LU system, TfL provided the required datasets for 265 stations during the study period (28/10/2013 -13/12/2013).Smart card data were missing for the remaining five stations.Our analysis only covers weekdays, during which the system is open for 18 hours per day, starting from 6:00 a.m. to midnight.Based on the assumption of exchangeability of weekdays (Silva et al., 2015), we generate a panel dataset with a total of 265×35×18×60/15=667,800 study units.Although the PSM method is a data-hungry method, the untreated pool (control group) is large enough to ensure adequate matches for treated units.Specifically, the ratio of the number of control and treatment units is around 15:1.

Propensity score models
We initially include three key baseline covariatespast disruptions, time of day and real-time travel demandin the logistic regression.We then iteratively add one of the remaining covariates at a time from covariates listed in Table 2, and conduct the likelihood ratio test to decide whether the additional covariate should be included in the final specification or not.Generalised additive models (GAM) are also be tested, but we do not observe any gains in the model fit.A high proportion of dummy variables (12 out of 18) may limit the gains from a flexible spline specification of the link function.The estimation results of the logistic regression model are summarised in Table 3.
The role of propensity score models is to establish a comprehensive index to represent all confounding factors, rather than predicting treatment assignment.While noting that the logistic regression model does not reveal the causal effect of covariates on the likelihood of incident occurrence, we succinctly discuss the multivariate correlations uncovered by this model.The coefficients of time dummies indicate that incidents are more likely to occur in morning peak hours.Positive signs on coefficients of the remaining confounders (except Rail dummy) confirm that all these factors increase the probability of encountering a disruption.Specifically, origin or terminal stations are more likely to face vehicle dispatching problems and depot related incidents.Surface stations are more susceptible to the surrounding environment than those in tubes.The accumulated number of past disruptions happened on the same day increases the probability of encountering another incident.Conclusively, the propensity score model reveals that the occurrence of metro disruptions is non-random, which, in turn, also justifies the application of causal inference methods in estimating disruption impacts.Alternatively, the estimated propensity score model can also be viewed as a binary classifier that predicts whether metro disruptions occur or not.To illustrate its diagnostic ability, we compute the area under the receiver operating characteristic curve: AUC=0.795, which again indicates that the occurrence of metro disruptions is non-random.

Matching results
Before the estimated propensity scores are utilised for matching, we inspect the common support condition (assumption 2 of the PSM method).Figure 4 presents the propensity score distributions for both disrupted and normal observations.The histograms display apparent overlap between the treatment and control groups, even for large propensity scores.There is no treated unit outside the range of common support, which means we do not need to discard any observations.We thus conclude that the overlap assumption is tenable in our empirical study.
The PSM method aims to balance the distribution of confounders between the treatment and control groups after the matching stage.To assess the quality of matching, we perform balance tests for four algorithms: subclassification matching, nearest neighbour matching without replacement ( = 1), nearest neighbour matching with replacement ( = 1) and nearest neighbour matching with replacement ( = 2), where M is the number of matched control units for each treatment unit.We find that nearest neighbour matching with replacement ( = 1) performs the best, improving the overall balance of all confounding factors by 99.97%.This improvement indicates that within matched pairs, the difference of propensity scores between treatment and control units reduces 99.97%, compared the original data before matching.

Imputation of missing vulnerability metrics
During the study period, 21 out of 265 stations did not encounter any service disruptions.We apply the random forest regression model to predict the missing vulnerability metrics of these stations.The input features of the model are indicated in Stage 3 column of Table 2. Table 4 compares the prediction performance of random forest regression, linear regression and support vector machines.
We consider four measures to benchmark the performance of random forest regression against other methodsmean absolute error (MAE), root mean squared error (RMSE), relative absolute error (RAE), and relative squared error (RSE).Whereas MAE measures the average magnitude of the errors in predictions, RMSE represents the standard deviation of the unexplained variance (Willmott and Matsuura, 2005).A better prediction model produces lower values of these performance measures.The results in

LU vulnerability metrics
The estimated vulnerability metrics vary across stations in the LU system.For 265 operated stations in 2013, during a 15-minute period of service disruption, the loss of station demand ranges from 0 to 743 passengers, the loss of average travel speed ranges from 0 to 4.78 kilometres/hour, and the loss of gross travel speed ranges from 0 to 1790 passenger-kilometres/hour. The spatial distribution of the three vulnerability metrics is visualised in Figure 5.For the demand loss and gross speed loss, the most vulnerable stations are in central London areas, while a small number of vulnerable stations are also located in suburban areas.Conversely, for the loss of average travel speed, the most vulnerable stations are scattered around outer London areas.These stations usually have only one metro line, and have very limited access to other transport modes compared to Central London areas.When passengers encounter disruptions, to continue their trips they need to wait for longer time in the system until train services are recovered.In other words, due to of lack of alternative routes5 , passengers at these stations tend to experience more individual delays.
We sort all 265 stations based on demand and speed loss, and present vulnerability metrics of the top 15 stations in Table 5. Six stations (Canary Wharf, Oxford Circus, Victoria, London Bridge, Bond Street, Green Park) are ranked among top fifteen vulnerable stations based on demand loss (first) and gross speed loss (third) metrics.Oxford Circus, London Bridge and Canary Wharf are the most vulnerable stations in the LU network.Under disruptions, they not only encounter significant demand loss, but also lose huge passenger distance per unit time.However, if we look purely at the metric of average travel speed, the most vulnerable stations are South Kenton, Latimer Road and Chesham in outer London areas, where each passenger suffers the longest delay due to lack of alternative routes.The above rankings based on different vulnerability metrics can assist metro operators in preparing effective plans for ridership evacuation and service recovery.Table 5 also presents normalised vulnerability metrics for these top 15 stations, which are the relative changes compared to the undisrupted performances (baseline).Within a 15-minute interval, the baseline for station demand metrics is the sum of average entry and exit ridership.The baseline for average speed loss is the average speed of trips starting from the station, and the baseline for gross speed loss is the average trip speed weighted by average entry ridership.Note that all baseline situations for the three metrics are calculated by using undisrupted observations.We find that the rankings based on relative vulnerability metrics can be different from the rankings based on absolute metrics, especially for the loss of travel demand.In more isolated parts of the network, where alternative routes may not be available, stations can lose up to 98% of their demand due to incidents (e.g.Warwick Avenue Station in Zone 2 with only one connecting line).This implies that more connected stations are actually less vulnerable in this respect, as passenger can find alternative routes if one of the lines becomes disrupted.This also highlights potentially important distinctions in terms of the interpretation of the proposed metrics.In terms of relative metrics of travel speed, average trip speeds at top three vulnerable stations -South Kenton, Latimer Road and Cheshamdecrease by 30.6%, 26.6% and 15.5%, respectively.Canary Wharf, Victoria and Liverpool Street are among top three vulnerable stations based on the loss of gross trip speed, with speed reduction of 156%, 121% and 115%, respectively.These stations are also among top five for both speed metrics considering the rankings based on absolute values.

Conclusions and Future Work
Incidents occur frequently in urban metro systems, causing delays, crowding and substantial loss of social welfare.Operators need accurate estimates of vulnerability measures to identify the bottlenecks in the network.We propose a novel causal inference framework to estimate station-level vulnerability metrics in urban mero systems and empirically validate it for the London Underground system.In contrast to previous simulation-based studies, which largely assume virtual incident scenarios and necessitate the adoption of unrealistic assumptions on passenger behaviour, our approach relies on real incident data and avoids making behavioural assumptions by leveraging automated fare collection (smart card) data.We also illustrate that incidents can occur non-randomly, which further justifies the importance of the proposed causal inference framework in obtaining the unbiased estimate of disruption impacts.
The proposed empirical framework consists of three stages.First, we conduct propensity score matching methods and estimate unbiased disruption impacts at the station level.The estimated impacts are subsequently used to establish vulnerability metrics.In the last stage, for non-disrupted stations, we impute their vulnerability metric by using the random forest regression model.We propose three empirical vulnerability metrics at station level, which are loss of travel demand, loss of average travel speed and loss of gross travel speed.The demand loss metrics reflects the amount of passenger who i) switched to other transport modes, ii) switched their departure time, trip origin or destination, iii) ended their trip, before and after entering the disrupted metro system.In other words, it implies the demand for alternative transport services during disruptions, which can guide metro operators to prepare effective service replacement plans.The two speed related metrics reflect the degradation in the level of service for passengers who still use the metro system under disruptions.These metrics provide essential information for service recovery to mitigate the adverse influence on passengers and the overall performance of stations.
The results of the case study of London Underground in 2013 indicate that the effect of service disruption is heterogeneous across metro stations and it depends on the location of a station in the network and other station-level characteristics.In terms of the travel demand loss and gross speed loss (overall delay), the most affected stations are more likely to be found in central London areas, such as Oxford Circus, London Bridge and Canary Wharf.On the other hand, considering average speed loss (individual delay), the most affected stations are scattered around outer London areas (e.g., South Kenton and Chesham) due to lack of alternative routes.For metro operators, the first demand metric is particularly useful when preparing service replacement plans and the two speed loss metrics can help mitigate the adverse disruption impact on passengers.
This study addresses a research gap in measuring metro vulnerability from an empirical perspective.The empirically estimated vulnerability metrics are practical because they reveal the actual impact of real incidents, rather than virtually simulated disruptions.The proposed methodology to obtain the unbiased estimates of disruption impact thus provides crucial information to metro operators for disruption management.It helps in identifying the bottlenecks in the network and in preparing targeted plans to evacuate ridership as well as to recover services in case of incidents.The direct integration of the estimated vulnerability metrics in preparing these target plans remains an avenue for future research.
It is worth noting that the proposed framework can be applied to other metro systems conditional on the availability of the required data on incident logs and confounding characteristics, among others.Future empirical studies can also incorporate other context-specific and relevant confounders in their analysis.For example, they can include interchange ridership as a confounder in the propensity score estimation.We do not include this covariate in our LU case study because it cannot be directly derived from smart card data, rather an advanced assignment algorithm is required to identify passengers' routes by matching smart card data with vehicle location data.
Moreover, disruption impact estimates are probabilistic relative to the sample data, that is, causal estimates and vulnerability metrics estimates have sampling distribution.Since our analysis is based on the data of LU from October 28 to December 13, 2013, the results of our case study reflect the vulnerability status of LU for this specific period.If we use data from other periods, the estimates of vulnerability metrics might change due to inherent temporal variations in travel demand and incidents.Therefore, to improve the generalisability of vulnerability metrics estimates, the study period needs to be long enough such that the sample is representative of the population.That is, a sample should capture supply-side interruptions as much as possible, including service disruptions due to maintenance.In addition, the sample should also reflect the possible fluctuations of travel demand.
Future research has three potential directions.First, stations surrounding the disrupted stations may also be affected due to disruptions indirectly, but this study does not account for such spill-over effects.Modelling spatiotemporal propagation of disruption impact requires significant methodological developments, which would be an important improvement over the current method.Second, the proposed vulnerability metrics can reveal static disruption impacts at different stations, but passengers need real-time service information to reschedule their trips.Thus, the current framework can be extended to update the vulnerability metrics dynamically.Considering the interaction between information provision and how it influences passengers' decision under disruptions, this advancement would improve the dissemination of the incident alerts to passengers in real-time.Finally, by merging data from other travel modes (e.g., bus) with metro datasets, we can estimate multi-modal vulnerability metrics in the same causal inference framework and understand the characteristics of the mode shift due to disruptions.

Figure 3 :
Figure 3: The illustration of calculating station-level supplementary factors.

Figure 4 :
Figure 4: Histogram of propensity scores to test the Common Support condition 4 .
(a) The loss of travel demand (b) The loss of average travel speed (c) The loss of gross travel speed

Figure 5 :
Figure 5: Spatial distribution of station-level vulnerability metrics in London Underground.

Table 1 :
A comparison of recent research on metro vulnerability.

Table 2 :
Available covariates for PSM and vulnerability imputation.
3 Source: London Datastore, published by Greater London Authority: https://data.london.gov.uk/census/.(a) Station affected areas (b) An example of LSOA data *  * computed for the affected area around each station

Table 3 :
The results of propensity score model (logistic regression).

Table 4 :
Table 4 indicate that the random forest regression outperforms other competing methods with the lowest MAE, RMSE, RAE and RSE for all vulnerability metrics.Prediction accuracy of different regression methods.

Table 5 :
Top 15 vulnerable stations based on empirical vulnerability metrics.