Validation of a multi-modal transit route choice model using smartcard data

Validation of travel demand models, although recognised as important, is seldom undertaken. This study adds to the scarce literature in this field by undertaking an external validation of a multi-modal transit route choice model. The model was estimated using smart card data for the urban transit network of Amsterdam before the introduction of a new metro line and is used to predict changes in travel behaviour after the network change. To validate, the model was checked for changes in estimated parameters between the two time periods, and predictive ability was evaluated at different aggregation levels. Although most model parameters were found to be unstable between the two contexts, the predictive performance at all levels was similar to the locally estimated model. Moreover, individual choices and transit mode-share predictions were found to be close to the observed ones. The errors were relatively larger for the link and route-level predictions, some of which could be attributed to the assumptions made regarding consideration choice set given as input to the model. On comparing alternative model specifications, using generic instead of mode-specific travel attributes lead to a strong degradation in predictive performance. Conversely, a model incorporating overlap between routes, with a better model fit in the base period, did not offer a clear improvement in prediction performance. The study highlights the need to validate transit route choice models before using them for deriving policy recommendations, especially in this data-rich age in which it can often be undertaken at a relatively low additional cost.


Introduction
The last few decades have seen substantial research into discrete choice models of transit route choice (Bovy and Hoogendoorn-Lanser 2005;Guo and Wilson 2011;Liu et al. 2010).These models aid in understanding transit riders' preferences by revealing the relative valuation of various travel attributes, often specifically focusing on service quality characteristics such as those related to transfers (Garcia-Martinez et al. 2018;Guo and Wilson 2011;Nielsen et al. 2021), crowding (Hörcher et al. 2017;Kim et al. 2015;Yap et al. 2020) or reliability (Swierstra et al. 2017).The relative valuations obtained from these models can be used for predicting passenger flows in response to changes in policy, enabling the comparison of alternative policy scenarios.When the selected model is close to the true representation of reality, the estimated parameters are expected to be stable for a reasonable range of temporal and spatial conditions, and the model forecasts are expected to resemble the observed demand.However, the process of model validation is only seldom undertaken, and model selection is typically made based on goodness-of-fit statistics such as log-likelihood and rho-squared (Parady et al. 2021).Although useful in their own right, models with high goodness-of-fit may not necessarily be well-specified and hence may not be transferable (Koppelman and Wilmot 1982).Issues like overfitting, endogeniety, omission of variables, measurement errors, incorrect model structure or incorrect theoretical assumptions about the travel behaviour could lead to a misspecified model, which may still have an acceptable goodness-of-fit statistic.
Model validation can be defined as "the evaluation of generalizability of a statistical model" (Parady et al. 2021) and includes both internal validation or reproducibility and external validation or transferability.External validation can be further divided into spatial transferability, temporal transferability, and methodological transferability (Parady et al. 2021), with the latter referring to the model performance on data collected using different methodologies.Although recognised as important, external validation is rarely undertaken in the case of travel demand models, probably due to the lack of suitable data.Parady et al. (2021) highlight that only 4% of transport academic literature published between 2014 and 2018 conducted an external validation.
In recent years, revealed preference data in general, and smart card data, in particular, has become increasingly available for inferring route choices of transit travellers (see for exampleHörcher et al. 2017;Jánošíkova et al. 2014;Kim et al. 2019;Yap et al. 2020).Depending on the penetration rate amongst transit riders, smart card data can provide information on almost all journeys made in the network at a highly disaggregated level.However, no information is available on the intention of the travellers, their origin location, and in many cases the time of arrival at the origin stop.Due to these limitations, several assumptions need to be made along the modelling process, specifically regarding the travellers' consideration choice set and the perceived level of service values.However, to the best of our knowledge, none of the studies that elicits route choice preferences from smart card data has attempted to validate their performance.This study aims to address this gap in the literature by undertaking an external validation of a transit route choice model using smart card data and thereby provide valuable insights on how transferable such models are and how we can facilitate their transferability.
A model of transit mode-route choice was developed for the urban transit network of Amsterdam, where a new North-South metro line was added to the existing bus, tram, and metro network in July 2018.Along with the addition of the new line, significant changes were made to the rest of the network (see Brands et al. (2020) for details).This major network change provides an opportunity to perform an ex-post evaluation of the route choice model developed based on data before the network change.Two types of validation tests are undertaken.First, we compare the model parameters estimated for the transferred ('before') model with the locally estimated model developed based on the data 'after' the network change.The two data sets used are ~3 months apart.Second, the demand changes estimated using the transferred model are compared with the observed demand after the network change.The results aim to establish the validity of route choice models estimated using smart card data for predicting the change in travel behaviour because of a major network change.
The rest of the paper is structured as follows: we start with reviewing the literature on "External validation of travel demand models".Section "Methods" describes the study setting and the various statistical tests used for model validation, along with the model specifications.Section "Results and Discussion" presents the results of the validation tests undertaken on our data, and Section "Conclusion" discusses the main conclusions.

External validation of travel demand models
External validation of models, or transferability, implies the ability of a model developed in one context to be useful in another context.Transferability is implicitly assumed when models are used to predict change in demand in response to a policy change.Some of the earliest literature on (external) model validation dates back to Atherton andBen-Akiva (1976) andTrain (1978).Since then, most work in this area has focused on the temporal transferability of models over long time horizons (often more than 10 years) and/or their spatial transferability.The primary motivation for such studies was to reduce costs of data collection and model development by using an existing model for a comparable region or during a different time period for the same region.Parady et al. (2021) provide a comprehensive review of the recent literature on the validation of discrete choice models in transportation.Here we narrow our focus to external validation studies, and discuss the main issues and corresponding learnings from these studies.
A fundamental theoretical assumption behind any model transferability is the consistency of underlying behavioural theory in both contexts.Koppelman and Wilmot (1982) highlight that model transferability is a "property of the estimation and application contexts, as well as the specification of the model".Naturally, a model with highly contextspecific variables will not be transferable to a new context.Sometimes the Alternative Specific Constants (ASCs) are updated based on the application context to account for average changes in unobserved variables between the two contexts (Atherton and Ben-Akiva 1976;Badoe and Miller 1995;Sanko and Morikawa 2010).While the updated ASCs capture the mean contribution of the unobserved terms, there could also be differences in the variance of these unobserved terms.Hence, before transferring a model, the scale for the transferred model needs to be updated to match the scaling differences between the two contexts (Swait and Louviere 1993).In cases where the estimation and application contexts are widely different, one could implement a partial model transfer with varying transfer scales for different sub-groups of variables (Gunn et al. 1985).This is especially applicable when some parameters are more transferable than others.For example, Fox et al. (2014) found the level of service parameters to be more transferable than cost parameters in their study of mode-destination choice models.
Multiple studies have noted that, generally, an improved model specification improves transferability (Badoe and Miller 1995;Fox et al. 2014;Rossi and Bhat 2014).However, some others also highlight the risk of overfitting which may reduce the transferability of models.For example, Fox (2015) found that although incorporating taste heterogeneity in time and cost parameters improved model fit for their base data, it did not necessarily result in enhanced transferability.Badoe and Miller (1995) also report a similar finding where over-specification led to reduced transferability.Overall, it is noted that a good fit in the estimation context may not be sufficient.
Another issue of concern is the ability of a model to capture causal relationships.As clearly highlighted by Atherton and Ben-Akiva (1976): "To be transferable, then, it is not enough that the model merely fit existing data; it must also explain why travel behaviour changes as conditions change.Rather than simply correlating existing travel behaviour with socioeconomic characteristics and transportation level of service, the model specification must represent the causal relationships between these variables.Thus, the causal specification of a model is a precondition to its consideration for transferability."For example, Chorus and Kroesen (2014) argue against the transferability of hybrid choice models for predicting policy outcomes, as these models (theoretically) cannot capture the causal relationship between the latent variable and the travel choice.
The only way to empirically establish whether a model is under/over specified or if it captures the causal relations required for transferability is to undertake a posterior analysis of transferability.Nonetheless, as Koppelman and Wilmot (1982) note, such posterior analyses of transferability are undertaken with the intent to provide insights that can be helpful for (future) prior transferability studies.This study aims to get such insights for the case of transit route choice models, specifically the ones estimated based on smart card data.
So far, most validation studies in the literature have been for mode or mode-destination choice models.In the case of route choice models, some studies undertake an internal validation (see for examples Lai and Bierlaire (2015), Mai (2016)), but very few an external validation.Bekhor and Prato (2009) were the first to consider the issue of transferability of route choice models.They undertook a spatial transferability assessment of traffic route choice models based on two independent revealed preference survey data sets, one each for Boston and Turin networks.In addition to assessing the transferability of the route choice models, they also evaluated the transferability of path generation techniques.In their case, the transferability of route choice model parameters could not be verified, partly due to the dissimilarity in characteristics between the two networks.
To the best of our knowledge, none of the studies so far have undertaken an external validation of a transit route choice model.This study addresses this gap by undertaking a transferability analysis across two closely spaced time periods for the same urban area, which allows for many exogenous factors to be controlled for, including any major changes in the underlying population.Specifically, the following issues are investigated using the smart card data from before and after a major network change: (i) How transferable are models of transit route choice estimated using smart card data, and can they be used for forecasting the changes in demand because of network changes?(ii) How does omitting/adding relevant variables (determined based on improved goodness of fit measures in the base context) impact models' prediction performance?

Case study context and data preparation
In July 2018, a new metro line (the north-south line) was introduced in the urban transit network of Amsterdam, the Netherlands, adding significant capacity to the existing network of metro, bus and tram lines.The new metro line runs through the dense historical city centre, and connects the northern part of the city with the centre-a connection which was made earlier via buses with highly circuitous routes.The opening of the new metro line was accompanied by a re-structure of the existing bus and tram network, including the addition of new feeder routes and re-routing or removal of duplicate routes.The new metro line differs from the existing ones in a few aspects-some of the stations (especially the ones in the city centre) are deeper than the existing metro stations implying a longer access time to the metro.In addition, the frequency for the new line is higher than the frequencies offered on the other metro lines (see Brands et al. (2020) for details).This significant change in public transport supply provides an opportunity to undertake a transferability analysis for the transit route choice models developed for the network using the two time periods corresponding to before and after the opening of the new metro, as shown in Fig. 1.We use 5 weeks of data in the time period before and 6 weeks in the time period after the opening of the new metro line.Although the two time periods used in this study are very close apart, the major changes to the transit network supply cause significant changes to the flow patterns (as shown in Brands et al. (2020)), making this case study ideal for undertaking a model transferability analysis.
We use a combination of smart card and Automated Vehicle Location (AVL) data for the route choice model estimation and validation analysis (see van Oort et al. (2015) for an overview of the Dutch smart card system).The smart card data used includes all the journeys made in the network, including those by tourists that could use an unlimited travel ticket for one or more days valid for all modes of public transport.These tickets also need to be validated for each public transport trip, and are hence recorded in the data.There are no mode-specific season passes in the network, and the fare is based on the (network) distance travelled irrespective of the mode used.It is also important to note that the same individual may be recorded multiple times, but owing to privacy concerns, we cannot track them across days and hence consider them as independent observations.
The raw smart card data is processed by undertaking cleaning, destination inference and transfer inference to form a journey database (see Dixit et al. (2019) for more details on these steps).For undertaking route choice analysis, we use only the morning peak period for our model estimation, as it is expected to have a higher share of commuters during this time, which are typically more regular travellers making their travel choices more conscious (Fox and Hess 2010).The choice set is derived based on the observed routes used by all the travellers in the data set.Transit stops in close proximity are clustered together to Fig. 1 The time period for validation analysis form a more realistic consideration choice set, and a threshold of minimum 20 journeys for each route in the before period (and 24 in the after period) is applied to ensure only reasonable routes are included (see Dixit et al. (2021) for more details on this).After applying all filters, a dataset of 382,295 observations for the before period and 563,210 for the after period is obtained which is used for estimation and validation of the model, respectively.This corresponds to 582 OD pairs in the before period and 593 in the after period.

Model specification
We specify and test three models of transit route choice.We start with an MNL model with mode-specific travel attributes, with the deterministic component of utilities as specified in Eq. 1.
where IVT bus andIVT tram are the in-vehicle times by bus and tram in minutes, respectively, WT bt is the expected initial waiting time for bus and tram modes, TT metro is the travel time by metro including the initial waiting time at the platform, Trans bt , Trans btm and Trans m are the numbers of transfers made within the bus/tram network (which includes bus-bus, tramtram and bus-tram transfers); transfers between metro and bus/tram; and transfers within the metro network, respectively, TrT is the transfer time in minutes, Circ is the circuity of the route measured as the ratio of network to Euclidean distance, and MSC bus , MSC tram and MSC metro are the mode-specific constants for bus, tram and metro, respectively.
It is important to note that the smart card data in Amsterdam contains different information for bus and tram versus metro.For buses and trams, the smart card is tapped inside the vehicle, whereas for metro, this happens at the station.Hence, the time measured by smart card data for metro includes the waiting time at the origin.For buses and trams the time of arrival of the passenger at the bus/tram stop is not known.Hence, the effective waiting time is derived based on the observed headway at each origin station using the corresponding vehicle arrival information available from AVL data.Hence, we use IVT bus/tram for denoting the in-vehicle time (without waiting time) for buses and trams and TT metro for the travel time by metro, which includes waiting time.
Amsterdam public transport network follows a (network) distance-based fare system, with the same fare per distance travelled irrespective of the mode(s) used.This makes the travel cost directly proportional to circuity for the alternative routes between an origin-destination pair.Hence, travel cost has not been considered as an additional, separate, variable in our models.
Next, to analyse the impact of omitting variables, instead of mode-specific in-vehicle time and transfer penalties, we use generic ones.The deterministic utility function in this case is shown in Eq. 2.
where IVT corresponds to the in-vehicle time in minutes, and Trans is the number of trans- fers made within or across modes. (1) Lastly, we test the model which incorporates the overlap between alternative routes.For this, we use a Path Size Correction Logit model which includes overlap of path and transfer nodes.The path size correction terms for journey legs ( PSC T i ) and transfer nodes ( PSC X i ) are as defined in Dixit et al. (2021), given by.
where t l = travel time for journey leg l in route i,T i = total travel time for route i, Γ i = set of all legs for route i, C = set of all routes between the chosen origin-destination pair, lj = leg- route incidence between leg l belonging to alternative route j, X i = Number of transfer nodes in route i,K i = set of all nodes for route i, and nj = node-route incidence between node n belonging to alternative route j.
The PSC terms decrease as the amount of overlap between alternative routes increases, with a maximum value of 0 for perfectly independent routes and a lower theoretical bound − ∞ for perfectly correlated alternatives.These path size correction terms are added to the deterministic utility function with mode-specific travel attributes as defined in Eq. 1.

Validation assessment
We divide the external validation metrics into two categories.The first category relates to the stability of estimated parameters, while the second one assesses the predictive ability of the model.The second category is further split into measures of disaggregate and aggregate predictions.Figure 2 shows this classification, and the sets of metrics used for each category.Subsequent sections elaborate on each of the metrics used.
While this study focuses on external validation of route choice models, it should be noted that the models used in this study have been validated internally using a cross-validation approach.Readers are referred to Dixit et al. (2021) for more details and results of the internal validation exercise. (3)

Model parameter equality
The first test of transferability consists of comparing the parameters estimated for the base and transfer contexts (the before and after situations in our case).This helps establish whether some parameters are more transferable than others.Since the two datasets are from different time points, we first check for differences in scale parameters between the two.For this, the two datasets are pooled together and the scale parameter is estimated relative to the 'before' dataset.
After adjusting for scale differences, the parameters estimated for the two cases are compared.For each estimated parameter, the relative error measure (REM) is calculated as: where µ is the scale parameter to account for differences in error variance between the two cases, after k is the parameter for attribute 'k' estimated for the after case, and before k is the parameter for attribute 'k' estimated for the before case.
Next, we check for statistical significance of the differences in each of the model parameters by means of a t-test, as described in Fox (2015).The t-statistic, in this case, is given by, The denominator corresponds to the standard error of the difference in parameters.In our study, although the two datasets were collected a few months apart, we do not link individual observations collected in different periods.When the covariance is assumed to be zero, and scale fixed, the standard error of difference is given by, where , and . Since model parameters cannot be interpreted directly, we also compare the model elasticities between the two time periods.Elasticities capture the sensitivity of a model to changes in key input variables such as travel times.The disaggregate point elasticity of alternative route 'r' for an individual 'i' with respect to a variable K is calculated as The disaggregate elasticity is aggregated by calculating a (probability) weighted average across all individuals and alternatives, to arrive at an average elasticity value which is compared between the locally estimated and transferred models.Being dimensionless, model elasticities can be compared across different models and time periods (Fox 2015).

Disaggregate measures of predictive ability
Next, we assess how well the transferred model can predict the outcome of the network change.For this, the model parameters estimated based on the 'before' data are used to estimate the probabilities for the 'after' situation.The outcomes obtained using the transferred model are then compared to those from the locally estimated model (i.e.model estimated with same the specification but using the 'after' data).In this section, we discuss the methods for comparing the performance for individual-level predictions.As there is no agreement in the literature on the best metric for this, we use multiple metrics, each providing a different perspective on it, as described below: • Transferability Test Statistics (TTS): The TTS statistic is similar to a likelihood ratio test undertaken between transferred and locally estimated model, both applied to the 'after' data.It is a strict pass/fail test and is chi-squared distributed with degrees of freedom equal to the number of model parameters.This has been used by Atherton and Ben-Akiva (1976) and Koppelman and Wilmot (1982) among others to test model transferability.
where LL after before is the log of the likelihood that the observed 'after' data were gener- ated by the transferred model, and LL before ( before ) is the log-likelihood of the locally esti- mated model on the 'after' data.
Although commonly noted, it has been observed that almost all models fail this strict test of transferability (Badoe and Miller 1995;Fox 2015).
• First preference recovery (FPR): Also referred to as 'percentage of correct predictions', this shows the percentage of choices correctly estimated by the model, given by Eq. 8: where y p i is the predicted choice (route) for an individual 'i', and y o i is the observed choice (route) for an individual 'i', andN is the number of individuals (observations) in the data.
As opposed to TTS, the FPR provides an indication on the degree of transferability and can be used to compare alternative models in terms of how well they can predict individual choices.It can also be useful when comparing the results with similar studies in the literature.However, a major limitation of this measure is its inability to differentiate between the range of probabilities assigned to the chosen alternatives (de Luca and Cantarella 2016).Hence, we look at another measure-Brier score-of disaggregate predictive performance that considers the probabilities assigned to the chosen and non-chosen alternatives.
• Brier score (BS): Brier score (Brier 1950) is an absolute measure used to quantify the accuracy of probabilistic predictions.For each alternative route in each observation, the predicted probability of choosing it is subtracted by the actual outcome.The square of this value is summed across all alternatives for each observation, and averaged across all observations.Mathematically, it is given by: (8) TTS after ( before ) = −2 * (LL after ( before ) − LL after ( after )) 1 3 where P ir is the predicted probability an individual 'i' chooses alternative route 'r'.y ir is equal to 1 is alternative route 'r' is chosen by individual 'i' and 0 otherwise,R i is the number of alternative routes available to individual 'i', and N is the number of individuals (observations) in the data.
The Brier Score has a minimum value of 0 for perfect predictions and a maximum value of 2 for the worst possible prediction.

Aggregate measures of predictive ability
From a policy perspective, one is often more interested in aggregate level shares as opposed to individual level predictions.Depending on the application under consideration and the requirements of the decision maker, different levels of forecasting may be relevant.For example, for understanding capacity and infrastructure issues as well as for operational planning, link-level forecasts are highly relevant to identify bottlenecks and related capacity needs.On the other hand, for service planning aspects such as fare scheme policies, network design, and frequency setting, mode or route level forecasts may be most relevant for assessing overall trends.Hence, in this study, we compare the shares estimated by the models at each of these levels of aggregation.To do this, individual probabilities are summed to calculate the market shares for each alternative route for each OD pair.The aggregate predictions are assessed using three metrics addressing the different levels of aggregation, as discussed below: • Predictions per route-Mean Absolute Error (MAE): The predicted shares (passenger flows) are compared to the observed ones, and the Mean Absolute Error (MAE) for each origin-destination (OD) pair is calculated as: where S p r is the predicted flows for alternative route 'r' for origin-destination pair 'o-d',S o r is the observed flows for alternative route 'r' for origin-destination pair 'o-d', and R is the number of available routes for origin-destination pair 'o-d'.
The MAE od is then averaged across all ODs to get an average MAE per route.

• Predictions per link-Mean Absolute Error (MAE):
The passenger flows per route are aggregated to calculate flows for each link.A link here refers to the path connecting two consecutive transit stops, which may be used by multiple transit routes.Similar to the route level, MAE is calculated for each link, and a mean MAE over all links is reported.The percentage error in predicting the flow on each link is also visualized to identify patterns.

• Predicted modal shares-Mean Absolute Percentage Error (MAPE):
The predicted passenger flows on each route are further aggregated to calculate the market share for each mode combination.This is specifically relevant in our case as we would like to know how well the model performs when estimating the impact of network change on public transport mode-shares.The observed and predicted mode shares are compared, and the Mean Absolute Percentage Error (MAPE) is calculated for each mode as: where P p m is the predicted mode-share for mode (combination) 'm', P o m is the observed mode-share for mode (combination) 'm', andM is the number of mode (combinations).

Results and discussion
We first evaluate the validity of the MNL model with mode-specific travel attributes as described in Eq. 1.Then, the impact of variable omission is examined by testing the validity of the two alternate model specifications.

Model parameter equality
We start with examining the stability of the estimated model parameters across the before and after time periods.Before comparing the parameters, the two models were checked for differences in scale parameters.The scale difference was found to be significant with a value of 0.92 for the after model relative to the before model, implying a lower variance in the unobserved parameters for the after case.
Table 1 shows the parameters estimated from the two models after scaling, and the corresponding REM and t-test statistic for each.The relative error measure is the highest for the mode-specific constants, implying a significant difference in the average effect of unobserved (excluded) variables specific to each mode between the two contexts.This could include attributes like comfort, safety, cleanliness, reliability, weather protection at stations, availability of information, ease-of-navigation or any other inherent (dis)preference for a particular (public transport) mode.Some of these attributes are expected to change after the introduction of the new line.For example, deeper stations of the new metro line may reduce the attractiveness of the mode.On the other hand, the higher frequency and more options for travel may lead to it being more attractive.Amongst the rest of the parameters, circuity is found to have the highest change (an increase of 45% in magnitude), followed by the number of transfers within the metro which is found to decrease in magnitude by 19%.Although the REM values for all other parameters are approximately 10% or less, the null hypothesis of the parameters being identical across the two cases is rejected for most of them (with a 95% confidence interval).Only the travel time by metro, number of transfers between bus and tram, and the transfer time are found to be stable across the two time periods as per the t-statistic.It should be noted though that for both transferred and locally estimated models, the parameters are precisely estimated with relatively low standard errors (t-ratios of > 20), which is often the case for models estimated with large scale data sources such as the smart card.Because of this, the null hypothesis of the parameters being identical across the two contexts is more likely to be rejected.
Next, we compare the model elasticities.Table 2 shows the elasticities for the transferred and locally estimated models.For both contexts, the (absolute) elasticity values are the highest for bus in-vehicle time, followed closely by the waiting time for bus and trams.The (absolute) elasticity values are found to be higher for the locally estimated model compared to the transferred model, implying the transferred model estimates the demand to be more inelastic that the locally estimated model.( 12) There could be several reasons for the differences in estimated parameters and elasticities between the two contexts.Firstly, there could be contextual factors that are not captured by the observed variables that could differ between the two contexts.Secondly, the underlying population may have changed-some travellers may have stopped using transit after the network change while other new travellers may have been added.Also, some existing travellers may have reduced/increased their travel frequencies.A related point is the possible presence of endogeniety, especially since our study is based on observational data.The models assume the explanatory variables to be exogenous, which may not be true.This could be due to multiple reasons, including omitted variables.For example, the new metro line has newer, cleaner and more aesthetically pleasing trains and stationscontributing toward comfort, which is not included in our model(s).Concurrently, the new metro line provides direct routes with lower circuity values compared to the rest, making the circuity correlated with the unobserved attribute of comfort.When using a model to predict the demand in response to changes in policy, it is important to have a model capturing the causal relationship between them.The smart card data used for this study does not provide the origin (home) location of the travellers.The missing attributes (such as access/ egress distance or time, comfort levels, reliability, and accessibility of modes among others) and/or endogeniety may hamper the establishment of a causal relationship.Lastly, one cannot theoretically rule out that our model may have been misspecified (wrong model structure or non-linear relationships between variables), or that the behavioural theory is altogether inconsistent with the observed behaviour.Irrespective of the reasons behind the instability of model parameter values, our results imply that one should be cautious when making inferences on the relative valuation of travel time or service quality attributes from such models, specifically if they have not been thoroughly validated.We also note that since we do not track the individuals across days, we are not able to capture the impact of panel structure of the data in our model.This may have resulted in an underestimation of the standard errors of parameters for both transfer and locally estimated models.This further motivates us to analyse the predictive performance of our model to establish its usefulness for applications.

Disaggregate measures
Next, we test the predictive ability of the model by forecasting the impact of the network change at an individual, route, link and mode level.The predictions are compared with the predictions from a locally estimated model (i.e.model estimated with same the specification but using the 'after' data) to benchmark the performance.We start with the measures of predictive performance at a disaggregate level, which are shown in Table 3.The TTS, compared against the chi-squared distribution for 11 degrees of freedom, strongly rejects the hypothesis that the two sets of parameters are equal.However, as many other studies note, most models fail this test of model transferability, but may still be good in their predictive abilities (see for example Badoe and Miller (1995); Forsey et al. (2014);Fox (2015)).In most cases, following the results from a strict pass/fail test such as the TTS blindly may not be a wise decision.While the TTS is useful to inform the modeler of the two models being different, as Parady et al. (2021) highlight, it is important to assess the extent to which they differ.To understand the extent of these differences in our case, we evaluate other disaggregate and aggregate level measures as discussed below. 1 3 The FPR of over 70% is found to be rather high compared to values reported for other route choice models in the literature, where this percentage ranges between 51 and 73% for some of the recent studies (Parady et al. 2021).Moreover, both FPR and the Brier score of the transferred models are found to be very close to the locally estimated models (< 1% difference), with the FPR being marginally lower and the Brier score slightly higher in the case of transferred models.Hence, although many of the parameter estimates differ for the two periods, the predicted choice probabilities of the transferred model at an individual level are found to be close to the locally estimated model.

Aggregate measures
Disaggregate measures like FPR and Brier score are often used for assessing the models in terms of their ability to predict individual-level choices in the new context.However, in most applications, one is more interested in the predictions at the mode, route, or link levels.Hence, we analyse the performance of the transferred model to predict the market shares at each of these levels.
First, we use the MAE to compare the local and transferred models in terms of their predictions at the route level (Table 4).The MAE for the transferred model shows an average error of 45 journeys per route for the transferred model as compared to 43 for the locally estimated model.Examining the predicted flows at a link level, we observe an MAE of 328 passengers per link over the entire morning peak period when predicted using the transferred model, ~8.6% higher than that those obtained by the locally estimated model.
Figures 3 and 4 show the error in flow prediction at the link level by the local and transferred models, respectively.The width of the lines corresponds to the observed flow on the link.A positive error implies that the model overestimated the flow on the link, while a negative error means an underestimate of observed flow.The maps show that the link-level flow predictions using the two models are similar overall, implying that using the 'before' data set for estimation of the model is not a problem per-se, compared to the inherent estimation errors when using such a model.The maps can give an indication of the possible causes of such errors.For example, close to the central station, there are two parallel tram routes with one showing an underestimation while the other an overestimation of flows.These parallel tram lines are highlighted with a red circle in the maps and enlarged in the top right corner of both the figures.The errors in estimation could be attributed to the assumptions made regarding the consideration choice set for the model.The smart card data does not provide information on the origin (home) location of the travellers.Hence, stops within a maximum distance of 500 m were clustered together to form the consideration choice set for travellers (for more details on the clustering process see Dixit et al. (2021)).In the absence of the actual origin location, all routes between the origin-destination stop-clusters are assumed to be equally accessible for the travellers.However, for the origin-destination pairs such  as the one highlighted where the distance travelled is very short, travellers are likely to choose the transit stop closest to them as opposed to the one with the shortest generalized cost that is predicted by the model.Hence, in such cases the link-level predictions can be erroneous, and should be used with caution.Next, we compare the market shares of each transit mode combination in the data (Table 5).The predicted and observed shares are found to be close to each other with a difference of less than 1 percent for most mode combinations for both local and transferred models.As expected, the MAPE is found to be slightly higher for the transferred model than for the local model.Overall, the transferred model is found to perform close to the local model in terms of mode-share predictions as well.

Impact of omitted variables
In this section, we analyse the impact of omitting/adding one or more variables on the model's predictive performance (Table 6).We test two scenarios:  1. Generic MNL: Generic travel time and transfer parameters as opposed to mode-specific ones as specified in Eq. 2. 2. Including overlap: Path size correction logit (PSCL) model including path size correction terms as defined in Eq. 3 to incorporate the impact of overlap between alternate routes Both in the estimation (before) and the prediction (after) contexts, the model with generic travel time and transfer parameters has the worst fit for the data, as shown by the respective log-likelihood values (even when adjusted for the number of parameters).The predictive performance is also found to suffer significantly when generic attributes are used.Conversely, when overlap is incorporated, the model fit is improved significantly in the estimation context (Likelihood ratio statistic of 838.6 exceeding the critical χ2 value of 9.2 at 1% significance level (df = 2)), but the log-likelihood for the prediction context is found to be lower than the mode-specific MNL model.In terms of predictions, there is a marginal improvement in the FPR and MAE for route-level prediction.However, the Brier score and the predictions at mode level are slightly worse.
When inferring route choice using revealed preference data sources in general, and smart card data in particular, the analyst does not have any 'direct' information on which attributes were considered by the decision-maker when making the choice.Hence, the selection of attributes to be included in the model depends heavily on the judgement of the analyst, and often data availability.It is known that the omission of a relevant variable can impact the model transferability (Koppelman and Wilmot 1986), especially when the missing variable is a confounding one.Conversely, if a simple model with fewer variables can perform just as well, then excluding variables can make the data collection as well as estimation easier.In our case, generic travel time and transfer parameters negatively impact the predictive ability of the transferred models.In contrast, including overlap does not offer a clear improvement in the predictive ability.In the end, the optimal selection of attributes depends on the purpose for which it is intended to be used.For predicting passenger flows in the regions where the tram was replaced by the new metro line, all attributes that distinguish a metro from a tram should ideally be included in the model.The mode-specific MNL shows that the travel time and transfer parameters are different for different modes.Hence, using generic travel time and transfer parameters impacts the predictive performance of the model significantly.On the other hand, correcting for route overlap typically leads to an improvement in model fit in the case of route choice models (like in our case for the estimation context).However, our results seem to suggest that it may not necessarily increase the transferability of the models, and the overlap term(s) may be context-specific and hence not as transferable.

Conclusion
This study adds to the scarce literature on the validation of travel demand models and is the first to undertake an external validation for a transit route choice model.The model was developed based on smart card data for the urban transit network of Amsterdam and was used to predict the impact of a significant network change (i.e. the introduction of a new metro line) on the route choice behaviour of travellers.Validation was conducted by comparing the parameter values and a series of statistical performance indicators for the predictions with the observed behaviour after the network change.
Our results are overall in agreement with existing literature: the conclusion regarding model transferability depends on the (statistical) test used (Koppelman and Wilmot 1982).In our case, model parameter equality failed for most attributes, implying care should be taken in directly inferring behavioural insights from the parameter values from models such as those used in this study, specifically if they have not been thoroughly validated.However, the predictive performance of the transferred model was found to be close to the locally estimated model.When compared with the observed choices at an individual level, the model performed satisfactorily with a First Preference Recovery of 71.5%.Moreover, the predicted mode-shares were close to the observed ones, with a MAPE of 0.3%.When used for route and link level predictions, the errors were relatively larger, but the performance of the transferred model was similar to the local model (less than 10% error increase).We also investigated the impact of omitting relevant variables on predictive performance.When the mode-specific travel time and transfer parameters were replaced by generic ones, the performance suffered significantly.Conversely, including overlap in the model specification did not offer a clear improvement in model predictions, even though it had a better fit for the base data.This suggests that overlap definition may be context specific and could perhaps be excluded when using a route choice model for predictions in favour of a parsimonious model.
When using smart card data for travel demand modelling, several assumptions are made regarding travellers' consideration choice set and perceived travel attributes.Visualizing link-level prediction errors can help indicate potential causes of errors.In our case, the assumption regarding consideration choice set may be responsible for some of the prediction errors, which are consistent between local and transferred models.
"All models are wrong, but some are useful" (Box 1976).To establish how wrong a model needs to be to stop being useful, we need more studies undertaking validation analysis for different networks and policy scenarios.Guidelines and standards on what is considered acceptable in terms of the various transferability statistics remain yet to be defined.In the past, the cost of undertaking model validation was high primarily due to data collection costs.The abundance of passively collected data such as the smart card provides an opportunity to validate transit route choice and assignment models at relatively low additional costs.Hence, validation must become an integral part of the development process for such models, and should be considered non-negotiable when using them for deriving any policy recommendations.licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http:// creat iveco mmons.org/ licen ses/ by/4.0/. and professionals.Oded holds a dual-PhD from KTH Royal Institute of Technology, Sweden, and Technion-Israel Institute of Technology.Niels van Oort works as an associate professor Public Transport at Delft University of Technology and is codirector of the Smart Public Transport Lab.He has been involved in public transport projects and research for over 15 years.His main fields of expertise are public transport planning and (data-driven) design, looking at the passenger perspective and societal impacts, related to for instance inclusiveness, land-use and sustainability.His application inspired work also involves shared mobility and first/last mile solutions.In addition to teaching, he is a frequently invited speaker and he published numerous articles in (international) journals and general media.His current research evolves around Smart Urban Mobility, with focal areas such as i) theory, modelling, and simulation of traffic and transportation networks, including cars, pedestrian, cyclists and novel public transport services (e.g.Demand Responsive Transit in combination with traditional PT); ii) development of methods for integrated management of these networks (regional network management, crowd and bicycle management; public transport operations); iii) impact of uncertainty of travel behaviour and network operations; iv) impact of ICT on network flow operations, robustness and resilience,; and v) urban data and their applications.In all these topics, his work considers both recurrent and non-recurrent (emergency) situations.

Fig. 3 Fig. 4
Fig. 3 Percentage error in prediction per link for the locally estimated model He studied traffic and transport at Delft University of Technology and finished his master on Public Transport in 2003.He started to work as a researcher at HTM, the public transport company of The Hague.During this job, he started a part-time PhD research in Delft on service reliability, including joint work with Professor Wilson at MIT.In 2010, he changed jobs to work as a public transport consultant at Goudappel Coffeng mobility consultants.He finished his PhD in 2011 and started to work at Delft university of Technology in 2012 as a part-time assistant professor.Since 2018, after founding the Smart Public Transport Lab, he is full time employed.Since 2020, he has been involved in small projects via his company Van Oort Mobility Consultancy.Ties Brands currently works as a Consultant Public Transport at Nederlandse Spoorwegen (NS).He holds aMSc degree at the University of Twente in Civil Engineering and Applied Mathematics.His MSc thesis was on optimisation of dynamic road pricing and was conducted at the Dutch consultancy company Goudappel Coffeng in Deventer.After graduation, he started working at Goudappel Coffeng as a public transport consultant, specialised in public transport modelling and data analysis.In 2010, he started, parallel to his work as a consultant, a PhD project at the University of Twente, Centre for Transport Studies.In 2015, he defended his thesis with the title 'Multi-objective Optimisation of Multimodal Passenger Transportation Networks'.In the thesis, the resolution of measures related to multimodal trip making and corresponding network designs are identified by solving a multi-objective optimisation problem.Methods to analyse the resulting Pareto set provide insight into how total travel time, CO2 emissions, urban space used by parking and costs relate.Between 2017-and 2021, he worked as a part-time postdoc at TU Delft working on the project studying the impact of the Noord/Zuidlijn: a metro connection in Amsterdam that started operation in the summer of 2018.It was mainly data-driven research, with aspects such as travel times, service reliability, ridership, passenger flows through the network and traveller's perception.Serge Hoogendoorn was appointedAntonie van Leeuwenhoek professor Traffic Operations and Management in 2006.He has been the chair of the Department of Transport and Planning since 2018 and is currently (one of the four) Distinguished Professor of Smart Urban Mobility at Delft University of Technology.He has a part-time appointment at Monash University, and is an Honorary Professor in the School of Transportation at South East University in China.He is a distinguished research fellow position at RIOH (Beijing).He is the PI Mobility in the Institute of Advanced Metropolitan Solutions (www.ams-amsterdam.com), a staff member of the TRAIL Research School on Transport and Logistics at DUT, and he chairs the Network Management foundation.He completed his PhD at Delft University of Technology in 1999.

Table 1
Model parameter comparison between models estimated on 'before' and 'after' datasets *p < 0.01 for all estimates **the reported estimates are after adjusting for scale differences a includes in-vehicle time and origin waiting times b includes bus-bus, tram-tram and bus-tram transfers

Table 2
Direct model elasticities for locally estimated and transferred model

Table 3
Disaggregate measures of predictive ability for locally estimated and transferred models

Table 4
Aggregate measures of predictive ability for locally estimated and transferred models

Table 5
Observed and predicted mode shares

Table 6
Predictive measures of transferability for alternate transferred model specifications