The value of additional data for public transport origin–destination matrix estimation

Passenger origin–destination data is an important input for public transport planning. In recent years, new data sources have become increasingly common through the use of the automatic collection of entry counts, exit counts and link flows. However, collecting such data can be sometimes costly. The value of additional data collection hence has to be weighed against its costs. We study the value of additional data for estimating time-dependent origin–destination matrices, using a case study from the London Piccadilly underground line. Our focus is on how the precision of the estimated matrix increases when additional data on link flow, destination count and/or average travel distance is added, starting from origin counts only. We concentrate on the precision of the most policy-relevant estimation outputs, namely, link flows and station exit flows. Our results suggest that link flows are harder to estimate than exit flows, and only using entry and exit data is far from enough to estimate link flows with any precision. Information about the average trip distance adds greatly to the estimation precision. The marginal value of additional destination counts decreases only slowly, so a relatively large number of exit station measurement points seem warranted. Link flow data for a subset of links hardly add to the precision, especially if other data have already been added.


Introduction
Passenger origin-destination data is an important input for public transport (PT) planning. PT demand is summarized in time-dependent origin-destination (OD) matrices, which state the number of trips between pairs of stations, i.e., the number of passengers from an origin to a destination station per time interval, such as 15-min intervals. The knowledge of such matrices may improve the efficiency of PT supply (Pelletier et al. 2011), e.g., cost-effective timetable designs (Sun et al. 2014), or for studying passenger costs from timetable changes to solve track capacity conflicts (Ait-Ali et al. 2020).
In recent years, many PT systems have adopted new technological solutions such as automated fare collection (AFC), automated vehicle location (AVL), and vehicle weighing systems that measure passenger link flows. These solutions generate useful data, e.g., smart card data or automatic vehicle weights, which can be used for OD matrix estimation. However, acquiring such data can sometimes be costly, since it often requires installation and maintenance of measurement equipment on stations, tracks, and vehicles. Having measurements on all stations and links can be prohibitively costly, so a PT agency needs to weigh these costs against the benefits of a more precisely estimated OD matrix.
In this study, we investigate how much the precision of an estimated dynamic OD matrix for a single train line increases when additional data becomes available. We use a case study from the London Piccadilly underground line. Starting with origin counts only, we incrementally add data about exit counts, link flows and average trip distance, and measure how the precision of the estimated matrix increases with additional data.
We concentrate on the precision of the most policy-relevant variables, namely timedependent link flows and station arrival rates, since these determine policy decisions such as service frequency (Ait-Ali et al. 2020) and capacity of stations and trains. They are also the key variables when analyzing passenger costs and benefits when adjusting timetables to solve capacity conflicts with other trains, as explained by Ait-Ali et al. (2020).
Our results suggest that entry and exit data alone is far from enough to estimate link flows with any precision. Information about the average trip distance adds greatly to estimation precision. Moreover, extrapolating from a limited number of destination counts or link flow measurements to the rest of the network results in lower added value, especially if prior data such as average travel distance is already included. Measuring a relatively large number of link flows and exit stations thus seems warranted.
Section 2 briefly summarizes the large literature on OD estimation. Section 3 describes the methodology and the case study. Results are presented in Sect. 4, and Sect. 5 concludes the paper.

Literature review
The research literature about OD estimation is rich and has a long history that can be traced back to the early twentieth century, e.g., with gravity models (Reilly 1931), entropy maximization (Cesario 1973), and Furness methods (Morphet 1975).

3
The value of additional data for public transport origin-destination matrix estimation Various origin-destination problems appear in many fields of transportation research (Doblas and Benitez 2005). Most studies treat the time-independent (or static) problem (Wang et al. 2012), but there has been an increasing interest in the (harder) time-dependent (or dynamic) version (Cho et al. 2009;Zúñiga et al. 2021) which is the focus of this paper. This is partly due to increased data availability through AFC data, which is valuable for more precise estimation of (dynamic) OD matrices (Gordillo 2006). Better OD estimates can be used to improve PT services in various ways, for instance by inferring the purpose of the trips (Alsger et al. 2018), pricing and allocating railway capacity (Ait-Ali et al. 2020), or by estimating in-vehicle crowding costs (Hörcher et al. 2017;Yap et al. 2018).
The OD estimation problems also differ in terms of the considered zones and the studied type of transport traffic. Some studies looked at the flow of road vehicles (Wang et al. 2012) whereas fewer considered passenger flow in PT systems (Alsger et al. 2018;Zúñiga et al. 2021) such as buses (Wang et al. 2011), freight (Shen and Aydin 2014) or passenger rail (Gordillo 2006). Similar to the study by Wong and Tong (1998), this paper focuses on the passenger flow in a commuter rail system.
The formulation of the problem also depends on whether prior (target) matrices exist. Many authors assume the existence of such a matrix (Wang and Zhang 2016). However, this is not the case in our study and many others (Cho et al. 2009).
Generally speaking, the OD estimation problem consists of finding the most probable matrix that is consistent with observations or minimizing the deviation from observations. The definition of "most probable" (and "deviation") leads to different formulations of the objective function and functional constraints, and thus to a number of OD estimation models. For instance, deviation functions can be modeled in various ways, e.g., using discrete choice models (Ben-Akiva and Lerman 1985), generalized least square (Cascetta and Nguyen 1988), Kalman filters (Cho et al. 2009), mean least square with entropy (Xie et al. 2011), gravity models (Shen and Aydin 2014). Other modeling approaches also exist, such as genetic algorithms with entropy (Fu 2012), principal component analysis (Djukic et al. 2012), Bayesian inference (Carvalho 2014), trip chaining (Alsger et al. 2016;Hora et al. 2017), Markov chain models (Abareshi et al. 2019) or artificial neural networks (Zúñiga et al. 2021). Due to the continuous development of new approaches, several authors summarize and compare many of the different OD estimation models, e.g., Cascetta and Nguyen (1988), Abrahamsson (1998), Peterson (2007), Bera and Rao (2011), Deng and Cheng (2013), and more recently Alsger (2017) and Li et al. (2018).
In the absence of a target matrix, this paper adopts the entropy maximization (EM) principle which is also equivalent to several models such as gravity (Wilson 1967), minimum information (Van Zuylen and Willumsen 1980) and discrete choice models (Mishra et al. 2013). The EM principle originates from the statistical theory of probability. In the context of OD estimation, the EM principle relies on the idea that there are many possible trip distributions (or system states) and that the most probable OD estimate (or state) is the one that maximizes the total entropy (or randomness). Variants of such a formulation have been adopted in many OD estimation studies. Fisk (1988) used a similar (time-independent) formulation and considered that the choice of the path depends on the total travel time (or congestion). Similarly, Brenninger-Göthe et al. (1989) used it in a multi-objective program for OD 1 3 estimation using traffic counts. The same formulation was also more recently used by Xie et al. (2011) andFu (2012).
Different types of data have been used to estimate OD matrices, e.g., cell phone network , tolling (Wang and Zhang 2016), GPS data and travel surveys (Ge and Fukuda 2016). In PT systems, the increasing adoption of AFC and AVL, and thus the availability of the corresponding smart card data, has led to the emergence of new applications based on such data (Nassir et al. 2011;Alsger et al. 2015), including historical and/or real-time data (Zúñiga et al. 2021).
Such studies focus on different aspects to improve PT planning and its efficiency. For instance, some research is about the estimation (Mosallanejad et al. 2019) or the validation (Alsger et al. 2016) of OD matrices in different PT systems, or more particularly about the problem of trip destinations (Trépanier et al. 2007). Moreover, different case studies exist from various PT networks around the world. These include entry-only and/or entry-exit systems from New York (Barry et al. 2002), Santiago (Munizaga and Palma 2012), China (Chen and Liu 2016) and London (Wang et al. 2011). London is also the case study in this paper. Readers interested in a summary of the different OD estimation studies using smart card PT data are referred to the review paper by Li et al. (2018). Wang et al. (2012), one of the only studies on the value of data, looked at the additional value of well-located sensors for improving road traffic OD estimates. However, although rich, the research literature on PT data does not include, to our knowledge, studies that look at the value of knowing such additional data for dynamic OD estimation. Hence, the purpose of this paper is to fill this gap in the literature by studying the value of smart cards and additional PT data.

Methodology and data
In this section, we describe and formulate the main problem, i.e., the OD estimation, the solution method (details in the appendix) and the case study.
Let n t ij be the number of passengers starting from station i in time interval t , going to station j . The (dynamic) OD matrix estimation consists of finding a time-dependent origin-destination matrix n t ij that is consistent with observations. This is done by estimating the entropy-maximizing matrix (the "most probable" matrix) that is consistent with observations of origin counts O t i , destination counts D t j , link flows F t l and the average trip distance d.
In the following, we assume that origin counts are always available since many PT systems collect such data at entry gates. Destination counts, however, are not always collected, since equipping exit gates with data collection equipment is costly. Link flow measurements require specialized equipment, such as automated vehicle weighing. The average trip distance is usually estimated using travel surveys.
From the network and timetable, the travel time matrix ij can be calculated. Given these, the estimated number of arriving passengers at station j in time interval t can be 1 3 The value of additional data for public transport origin-destination matrix estimation calculated as Σ i n t− ij ij . The estimated flow on link l at time t can be calculated as The core question of the paper is how the precision of the estimated matrix improves when more and more data become available. Let L be the subset of links where link flows are known, and Δ is the subset of stations where exit counts are known. Similar to the EM model by Xie et al. (2010) and Fu (2012), the studied dynamic OD estimation problem is formulated in Eq. (1).
The central question can now be stated as: By how much is the precision of the estimated OD matrix n t ij improved when additional data becomes available, i.e., when the sets Δ and L become larger?
We must thus define what kind of "precision" we are interested in. In applied policymaking, e.g., timetable design and investments in links or stations, the exact cells of the OD matrix are less important. What matters most are station flows and link flows, since this determines the crowding levels in vehicles and stations. This is used for decisions about link and station capacity upgrades, station staff planning, and timetable design (timetable optimization depends mainly on passenger departure and arrival rates per line segment, and on crowding levels on different links). Hence, we will concentrate on how close to reality the estimated OD matrix is in terms of link flows and arrival rates per station (origin rates are assumed to be known). We thus measure the relative root mean square error (or deviation) for link flows ( RMSE link ) and arrival rates at destination stations ( RMSE dest ) , and study how these vary with more available information such as when the sets of available link flows and destination counts, L and Δ , become larger.
be the estimated number of passengers arriving at station j and time interval t, and the estimated link flow on link l and time interval t . The relative errors are then defined as in Eqs. (2) and (3). (1) For those stations and links where data is available ( j ∈ Δ and l ∈ L ) the errors will of course be zero (assuming that the optimization problem is feasible). The errors hence measure the deviations for the unobserved stations and links-in other words, how well the available station and link data can be extrapolated to unobserved stations and links. Table 2 lists the different combinations of destination counts, link flows and average travel distance that we will explore.

Solution method
The EM estimation model is a convex (nonlinear) minimization program with a nonlinear objective function (the total entropy) and linear constraints. Finding a solution for time-dependent real-world instances (e.g., large networks and/or longer periods) is generally hard. Thus, instead of using state-of-the-art solvers, we derive the iterative solution methods using Lagrangian relaxation.
We first relax the constraints and associate corresponding Lagrangian multipliers as presented in Table 1. This leads to the formulation of a Lagrange function (or relaxed dual objective function). More details can be found in the Appendix.
Using first-order optimality conditions on the Lagrange function, we can formulate the (primal) solution, i.e., OD estimate as a function of the (dual) Lagrangian multipliers. Depending on the studied data, we find different solution formulations of the dynamic OD estimate n t ij . Table 2 presents the formulations for the different studied variants. A more detailed derivation of these solution formulations is described in the appendix.
To estimate the multipliers, we use the problem constraints. In some trivial cases, it is possible to find a closed-form expression such as in the basic O model The value of additional data for public transport origin-destination matrix estimation , all destinations have a similar attractivity. In other (more interesting) cases, this is often difficult (sometimes impossible). Thus, we attempt to find numerical solutions by iteratively balancing the relaxed constraints corresponding to additional studied data. Figure 1 is an example of an iterative algorithm to find the numerical solution of the multipliers for the O-d model. More details about the iterative algorithms can be found in the Appendix.
The iterative solution algorithm stops when the constraints are satisfied, up to a certain tolerance . Note that the use of the (hard) constraint of origin counts (from smart cards) to derive an analytic expression of the dynamic OD estimate yields the formulation in (4).  The term p(j|i, t) can be seen as the probability of choosing destination j when departing from origin i at time interval t . In this case, the exponent u t ij can be interpreted as the total utility for traveling to j from i during time interval t . Such utility may include parameters k (1) ijt , … , k (m) ijt corresponding to m types of additional data, if available. The coefficients (or multipliers) 1 , … , m are estimated to reflect utilities (if ≥ 0 ) or disutilities (if not). The constant K t j is specific to the destination station j and time interval t.
Such interpretation can also be found in discrete choice models (without the random error term) where the discrete choices are between the different destination stations j given an origin i . The (alternative-specific) constants K t j and parameters 1 , … , m are specific to the PT system where the OD estimation is performed. They need to be estimated to reflect the (dis-)utilities explaining the choice of the passengers. It is possible to estimate the values of these parameters using additional data, e.g., from smart cards, stated (or revealed) preference surveys, old OD or target matrices from the same PT system.

Case study data
To explore the question formulated above, we use a case study based on the London Piccadilly underground line. Transport for London (or TfL) provides open access to a comprehensive multi-rail demand dataset as part of the NUMBAT project (TfL 2018). Based on the use of smart cards at entry/exit station gates during a typical 2018 autumn weekday, the dataset provides information about the number of passengers boarding and alighting at each station (per 15 min), and link flows (per 15 min) for a subset of the links (data for around 100 links are available, but we study 12 of the most crowded links). The data also contains an estimated OD matrix for longer time periods which is used in this case study to calculate the average travel distance.
The Piccadilly line (Fig. 2) is more than 70 km long and consists of 53 stations with two different western branches at Acton Town station. Note the one-way trajectory around the Heathrow airport from/to Hatton Cross through terminal 4 then 2 and 3.
In Fig. 3, stations are sorted according to their location on the studied line to make it easy to visualize the symmetry of the distance matrix. However, the matrix, as shown in the figure, is not completely symmetric, see around the airport due to the previously mentioned one-way trajectory.
To calculate the average travel times ij , we use the travel distances between each pair of stations which is illustrated in Fig. 3. We assume that all trains are running according to the train timetable (headways) presented in Table 3, and that their average speed is 33 km/h (TfL 2018). (4) The value of additional data for public transport origin-destination matrix estimation As presented in Table 3, we focus in this study on three different time periods, i.e., morning and afternoon (peak) as well as midday (off-peak). These periods are also illustrated in Fig. 4 which also shows the variation of both the origin (boarders) and the destination or exit (alighters) counts per 15-min time interval over the day. The studied time periods are separated by dashed vertical lines in the figure.
In addition to the temporal variation (per time interval) of the number of boarders and alighters as shown in Fig. 4, we present the spatial variation (per station) in Fig. 5 over the day. The stations on the horizontal axis are sorted by the number of alighters (from highest). Figure 6 presents the link flows during the day for three of the largest links.
The average travel distance per passenger d is usually estimated from demand travel surveys. For our case study, we calculate it based on the available OD matrices (per time period). Table 4 shows the average travel distance in km per passenger for the different studied time periods of the day.

Results
In this section, we present results on how the precision of the estimated matrix varies when more data is included in the estimation. We focus on the precision of arrival rates at destination stations and link flows. Several scenarios with different types of additional data are tested. Table 5 presents an overview of the reported results, i.e., tested models and the corresponding presented estimation errors.
Two types of data are incrementally added, i.e., destination and link data. The value of such additional data is studied by testing different estimation models. The average travel distance is also studied in certain models. The value of additional data for public transport origin-destination matrix estimation  Note that when incrementally including link flows in O-D-F and O-d-D-F, exit counts (destination data) from all stations are considered unlike other models (i.e., O-D and O-d-D variants) where these are also incrementally included.

Estimating arrival rates per station
We first focus on the destination estimation, i.e., the number of alighters in the system per 15-min interval and station. Figure 7 shows how the relative error ( RMSE dest ) varies when data for more and more stations is added. Stations are sorted according to their total number of alighters, and for each step along the x-axis, data for one additional station is added. The RMSE dest error is presented separately for three parts of the day (morning, midday and afternoon). When data for all destinations has been added, the error is of course zero.
Surprisingly, including a relatively small number of destinations increases the error for both the midday and afternoon periods. Only after a certain amount of destination data has been added does the error decrease. Adding the average travel distance data in the estimation further reduces the relative error (up to 50%). For the midday and afternoon periods, just adding the average travel distance decreases the error by as much as adding data for almost all destinations but without the average distance.
These results suggest that having data only for a subset of destinations is sometimes not enough-in fact, it may even increase the overall error, e.g., midday and afternoon periods. Having enough exit counts seems to be important to get better estimates. However, systems for collecting such larger amounts of data are often expensive, e.g., design, operation and maintenance. Thus, the importance of comparing the data collection costs and value for alternative types of data, e.g., average trip distances which seems to be highly valuable here.
With a focus on the O-F model, Fig. 8 shows how RMSE dest varies when incrementally adding link flow data from 12 of the most crowded links. The error is presented for the three-time periods of the day, and links are sorted and added according to their passenger flows. The order of added links may differ between the different periods, see later in Fig. 10, for the specific added links.
Including link flow data does not seem to always reduce the destination error. It increases during midday peak hours whereas it remains almost constant in the afternoon. The exception is the morning period as the error is slightly reduced when more links are added.
The value of additional data for public transport origin-destination matrix estimation The results indicate that detailed (often expensive) data such as link flows are not always valuable. It is therefore important to investigate (cheaper) complementary data types that can improve the quality of the estimates. Later in Fig. 10, we study the combination with other additional data, i.e., average travel distance and exit counts.  Fig. 7 which presents the deviation errors of the (more aggregate) exit counts, Fig. 9 shows the error for the (more detailed) link flows between the studied OD pairs. Thus, the errors are higher in Fig. 9.
Even after all exit counts have been included, the link flow error is far from zero. In fact, adding destination data hardly improves the link flow estimation, apart from the first few destinations. As above, information about the average trip distance greatly decreases the error. Surprisingly, however, adding more destination data tends to increase the error in this case. Therefore, better (and more economical) estimates of link flows can be reached by combining different types of data and by using the right (amount of) data points.
To get decent precision in the link flow estimation, link flow data seems to be necessary. Figure 10 shows the change in RMSE link when incrementally including flow data for 12 of the most crowded links (as in Fig. 8). The figures show estimations without destination data (O-F), with all destination data (O-D-F), and with destination data and average travel distance (O-D-F-d).
In the O-F model, the error decreases with more data, as expected. However, although the relative error is lower (than O-F), in models with destination data (O-D-F) and average travel distance (O-d-D-F), it remains almost constant when link flow data is added. This is the case for all the periods except for the morning peak hours where the relative error decreases after adding the 4th link but remains constant after.
These results indicate that the marginal value of additional link flow data is high only when data such as exit counts and average travel distance are absent. If such data exists and is used in the estimation, the value of link flow data becomes almost insignificant. It is therefore important to compare the cost of collecting the additional data with its marginal value. Including more link data can provide further The value of additional data for public transport origin-destination matrix estimation insights, however, the estimation models tend to become more computationally expensive. Fig. 9 Variation of RMSE link as destination data is incrementally added 1 3

Conclusions and future works
Even if the literature includes several studies on (dynamic) OD-matrix estimation, this work attempts to assess the marginal value of additional data in terms of estimation precision. The additional data we have studied is arrival rates per station (which may be collected through AFC systems or specialized equipment), link flows (which may be collected by a vehicle weighing system) and average trip distances (from travel surveys). We explore this through a case study based on the London Piccadilly line in 2018, separating three time periods of the day (i.e., morning and afternoon peak hours, midday). We focus on the precision of the estimated time-dependent arrival rates and link flows, rather than on individual cells in the time-dependent OD matrix.
The results indicate that arrival rates per destination station (if enough) may improve the estimation, but in two cases of three, including data for a subset of destinations made the estimation worse. Perhaps contrary to expectations, it turns out to be valuable to have data for a very large share of destinations: the marginal value of acquiring more data, even for the last stations, is surprisingly high. Arrival rates (exit counts) can be collected easily for AFC systems which are based on "tap-in/ tap-out" (such as London), but for entry-only AFC systems (such as Stockholm), special data collection equipment needs to be installed at exit gates. Our results suggest that installing such equipment may only lead to marginal improvements of the estimated OD-data unless a large (enough) share of stations is equipped. If such equipment is costly, it might be more cost-efficient to consider other forms of data collection, and to study the value of the collected data.
Similarly, the study of a subset of added link flows indicates that link estimation may improve but only if no prior additional data is already added. Otherwise, the estimates are better (than with no prior data) but do not improve with added link data. Thus, the marginal value of such detailed data may be insignificant if specific prior data is already included. These results show that detailed (often expensive) data may have a lower marginal value for the demand estimation and can therefore lead to less accurate demand-sensitive policy decisions, e.g., setting welfare-optimal line frequencies. The value of additional data for public transport origin-destination matrix estimation Based on a study case, the paper highlights that collecting additional, more detailed data (often more expensive) is not always leading to more accurate estimates, i.e., lower marginal value. Thus, the results emphasize the importance of considering both the costs of collecting such additional data and its marginal value.
There are a number of possible future works that can further validate these results, e.g., using other estimation models, metrics for the valuation of the estimation quality, and by studying additional data sources in other case studies. For instance, we used the relative RMSE to quantify the precision of these estimates, but other metrics can be tested in future work, such as the implied optimal service frequencies (Ait-Ali et al. 2020), or levels of in-vehicle crowding (Çelebi and İmre 2020). Full-day estimation instead of per period can also be tested when additional data is lacking. However, assuming that the time-aggregated OD matrix is symmetric is a strong assumption, and is for example violated in our data set. Furthermore, such full-day estimation also requires additional computational power and can be intractable for large networks.
Overall, information about average trip distances gives by far the greatest improvement of the estimation. Acquiring such estimates, from travel surveys, link flow measurements or other means is hence a priority. In this study, we have only used one average distance (per time period) for the whole line, but obviously, getting more detailed data (for parts of the line) would be highly valuable. Furthermore, instead of gradually including data based on the magnitude of the counts or flow, other orders can also be tested, e.g., based on job or home locations during peak hours. Closely related, the model can be adapted to find more valuable data collection strategies, e.g., data types and their spatiotemporal locations.

Appendix 1: Solution formulation
The Lagrangian relaxation of the different constraints with the corresponding multipliers leads to the following Lagrange function The first-order optimality condition for the function L in terms of the variable n t ij is as follows Note that the multiplier j,t+ ij is only included if j ∈ Δ , i.e., known data on the exit counts at station j . Similarly, lt is also only included if l = (i, j) ∈ L , i.e., known flow at link l . Thus, we have the following general solution formulation To sum up, depending on the studied additional data, we have different variants of the solution formulation as presented in Table 2.

Appendix 2: Iterative algorithm
The iterative algorithm aims at estimating the multipliers, i.e., , , and . For that, each iteration of the algorithm attempts to balance the different constraints until these are satisfied (up to a certain error tolerance ). To derive the algorithms, we first use the (hard) constraints for the origin counts ( O t i ≠ 0 ) to estimate it as follows With smart card data on counts ( D t j ≠ 0 ) at large destination stations, we use the corresponding constraints to estimate jt as follows Similarly, the constraints for additional data on the average travel distance d can be used to estimate by finding the solution (root) of the following equation When we include additional data on flows ( F t l=(i,j) ≠ 0 ) at crowded links, we estimate lt by solving the following system of linear (in e lt ) equations The value of additional data for public transport origin-destination matrix estimation The iterative algorithm stops either after a certain number of iterations or when all the constraints are satisfied, e.g., using RMSE and an error tolerance .
⇒ e lt e it + d ij + j,t+ ij + ∑ s < l e > l l * = (s, e) ∈ L l * ≠ l e l * ,t− si e s,t− si + d se + e,t+ se − si = F t l