Within the process of building a travel demand model calibration and validation are two iterative and repetitive steps. Calibration describes the process of modifying the model, refinement of specifications, correction of (previously undiscovered) erroneous input data or adjustment of (non-empirically verified) parameters. The validation on the other hand describes the process of testing if the model has reached a certain goodness-of-fit. This goodness-of-fit does not refer to a particular key performance indicator. Ideally, before the model building process starts, the tendering institution determines how accurately the model should reproduce e.g. number of trips, traffic volumes, mean trip distances and mean trip times for each mode and trip purpose, keeping in mind the availability of reference data and its quality. If the validation process concludes that the model quality is not sufficient, adjustments are made to the model in the step of calibration.

Frequency distributions, which are another type of key performance indicators, describe travel behaviour using specific indicators in discrete classes. There are two common cases: frequency distributions of (1) the time travelled, which are also known as trip time distributions or trip time frequency distributions, and (2) the distance travelled commonly known as trip distance distributions or trip length frequency distributions.

Among others, those distributions are an important tool to evaluate how well a model fits with observed sample data. Therefore, the comparison of distributions is an essential part in the model validation process. Despite its importance the common modelling guidelines from the UK (WebTAG (Department for Transport 2014)), the USA (Travel Model Validation and Reasonableness Checking Manual (Cambridge Systematics Inc. 2014)) or Austria (draft of Qualivermo (Sammer et al. 2012)) provide little information about the correct structure and handling of such distributions. Likewise, common statistical methods that check whether two samples derive from a common population are not practicable for application during the validation of transport models, as will be explained later. This lack of rules leads to individual solutions, which complicate the model validation process, the comparison of models and the definition of thresholds for quality indicators in guidelines.

The following Fig. 1 is intended to illustrate this issue. The same data set (number of trips with their corresponding trip length from a household travel survey and from a travel demand model) is classified into discrete classes in three different ways: the left figure is classified into 20 equidistant classes (1.5 km each), the middle figure into 10 equidistant classes (3 km each) and the right figure into 5 equidistant classes (6 km each). The overlap of the observed and modelled distribution, which is measured by the Coincidence Ratio (CR, which will be explained in more detail later), shows that the lower the number of classes, the less overlap there is. In other words, in an extreme case with only one or two classes, there would be a very high overlap. In this example, it would mean that the classes would only have to be large enough to reach the threshold of CR ≥ 7 proposed in the Travel Model Validation and Reasonableness Checking Manual (Cambridge Systematics Inc. 2014). Having said this, the modelled distribution in the left figure would probably be rejected, while the others would be acceptable, although the underlying data set is the same in all three figures.

Fig. 1
figure 1

Classifying the same data set with different classification rules leads to different valuation results

In general, this means that the resulting distribution evaluation is meaningless if there are no rules for creating the distribution, because even a poor model can be twisted so that the results look fine. Thus the overall goodness–of–fit is at risk. Therefore, guidelines for model validation need to suggest an appropriate way of how to build the distributions.

This is one of the reasons for many misunderstandings between the model users (tendering institutions) and the model developers (contractors) in the context of model development. When validating a model with distributions of time and distance the following questions inevitably appear and they should be answered beforehand by the tendering institutions:

  • What indicator should be used for the classification?

  • Which area is covered by the distribution?

  • What is the reference for a comparison of distributions?

  • How many classes should be distinguished? What is the appropriate size of each class? Should the class size increase with the distance travelled? How to proceed with empty classes?

  • What quality indicator should be used when comparing two distributions? Should absolute or relative frequencies or both be evaluated?

The questions above are also related to the issue that the classification often depends on the dimensions of the study area. This is another reason for individual and not comparable solutions. In order to provide general guidance on how to handle frequency distributions within the model validation process this paper proposes a standardized classification method that can be applied independent from the indicator used and from the dimensions of the study area. For the data set in the example in Fig. 1, this would lead to a single generic classification. It is therefore a necessary complement to the existing guidelines in order to make model validation results more evaluable and comparable.

Selection of indicators for a classification

In a trip distance or a trip time distribution, the trips are assigned to discrete classes. This requires the number of trips as well as distance or time values on the level of origin–destination-pairs (OD-pairs).

Figure 2 shows a simple example with two OD-pairs (from A to B and from B to C) and two modes (car and public transport - PuT). The modes require different amounts of time to cover the distance between the OD-pairs. This can lead to cases where the trips of one OD-pair fall in different classes for different modes. Such cases occur frequently for trip time, but they can also occur for trip distance, as public transport trips are often longer than car trips. In the example of Fig. 2 the lower left chart shows that the travel demand for each OD-pair is assigned to different trip time classes, because a mode-specific indicator was used for classification.

Fig. 2
figure 2

The classification by a mode-specific indicator causes the demand of one OD-pair to be assigned to different classes (left chart), whereas with a mode-independent indicator the demand of one OD-pair is assigned to the same class (right chart)

This effect is difficult to understand in more complex multi-modal trip distributions. In order to avoid this effect entirely and for comparisons on OD-pair level (e.g. a distance-dependent modal share), the travel demand of all modes on one OD-pair should fall into the same class.

To achieve this, a mode-independent indicator should be used for the classification. In case of trip distance distributions, the direct distance between an OD-pair can serve as a “natural” mode-independent indicator. For trip time distributions, such a “natural” mode-independent indicator does not exist. An indicator weighted with the number of trips provides a solution for this.

In the example of Fig. 2, the lower right chart displays a trip time distribution, which uses a mode-independent indicator for classification. As a result, both car and PuT travel demand fall in the same trip time class for each OD-pair.

Table 1 summarizes the options for selecting a frequency distribution indicator.

Table 1 Classification indicators for trip distance and trip time distributions

A standardized classification requires rules for handling intrazonal trips. In the case of a macroscopic travel demand matrix the main diagonal represents the intrazonal trips. Since this part of the demand is not assigned to the network model, indicators like trip time or trip distance cannot be calculated, but must be estimated in such a way that they represent an average movement within a zone. Those mean indicators will never fit to an observed intrazonal trip. Therefore, when using distribution comparisons for model validation it is advisable to exclude the intrazonal trips.

Nevertheless, as Bhatta and Larsen (2011) point out omitting intrazonal trips leads to different results in model estimation. Therefore, the model developers are obligated to check the intrazonal trips in a separate analysis. Such an analysis should investigate the number and the modal split of intrazonal trips. A separate examination for different zone sizes may also be reasonable.

Selection of the study area and a reference distribution

A travel demand model computes demand for a defined study area. The model is calibrated with observed data from a household travel survey. This survey ideally covers a sample of the population from the entire study area. If the study area of the survey and the study area of the travel demand model do not match completely, the comparison of observed and modelled values should only include trips starting and ending in the common study area. This also means that observed trips leaving the common study area have to be excluded from the survey when comparing the observed and the modelled distributions. A comparison of two distributions is only meaningful if observed and modelled trips relate to the same area. A separate analysis is recommended for modelled trips where no comparison data is available.

An additional point to consider is that only home-based trips can be used for validation. For example, modelled trips from home to work can be compared directly with survey data, while modelled trips from shopping to leisure are not necessarily made by residents of the study area and therefore not comparable to a survey of the study area.

Calibrating the destination choice of a travel demand model requires observed distance or time distributions from a household travel survey as a reference distribution. Since interviewed persons tend to round estimates of distance and time for their reported trips, reported distance and time values should not be used directly from the survey. Instead, the distance and time values should be computed with the model using the geocoded origins and destinations from the survey (FGSV 2012).

As Sammer et al. (2018) point out, it should be noted that a household travel survey always contains a systematic error, which results from underreporting of travel behaviour. Hence, even a complete fit of observed and modelled distribution does not mean, that a realistic behaviour is modelled. Sammer et al. (2018) give advice how to increase the quality of a household travel survey.

The purpose of a model is to analyse impacts of scenarios compared to the impacts of a base case. For this, planers look at key performance indicators. Key performance indicators can be single values (e.g. volumes at certain locations, total distance travelled and total time spend) and distributions of travel time and distance. For comparing distributions of a base case and scenarios, distributions derived from the base case of the travel demand model replace observed distributions from the household travel survey as reference distribution.

Selection of a classification method

Class width versus class size

Usually distribution classes are determined in such a way that the class width and a number of classes are predefined. Based on the classification indicator the demand is then assigned to the resulting classes. If the class width is the same across all distribution classes (and the number of classes is infinite in an extreme case), the term “equidistant” distribution is used. However, in many cases insufficiently populated classes are aggregated, resulting in classes of varying width.

A different approach defines the share of demand in each class and uses it to determine the class width. The resulting classes vary in width but they are equally populated. This method of classifying distributions is called “equiquantile” (based on e.g. Paluš (1995)).

In the case of an equidistant classification, three questions need to be addressed:

  • What is the size of the classes?

  • How many classes should there be in total?

  • Should classes be grouped together and, if so, how?

On the other hand, the method of equiquantile distribution only raises the question about the number of classes. For this reason, the equiquantile classification method is presented first. Based on this, a procedure is explained to answer the question about the class width of equidistant distributions.

Analysis with equiquantile classes

The class boundaries of this classification method are calculated using weighted quantiles, with the elements of the indicator matrix representing the classification variable and the demand matrix elements representing the weight. The calculation used for the classification and an example calculation are shown in the appendix.

Although all classes should have the same quantity, the actual demand per class may differ slightly from the desired quantity. This may be due to the following reasons:

  • A discrete demand cannot always be distributed completely and evenly across all classes. This is the case, when \( \left( {demand} \right)\,\bmod \,\left( {number\;of\;classes} \right) \ne 0. \) As a result, at least one class has a greater demand than other classes. The smaller the population, the greater the deviation from theoretically equal classes. However, this effect is reduced as the sample size increases.

  • A single OD-pair with a large demand can also lead to a distortion, for example when \( demand_{od} > demand_{total} /number\;of\;classes. \)

  • A rounding of class boundaries prior to classification can lead to shifts in demand between neighbouring classes.

As mentioned above, the number of classes must be defined for the equiquantile classification. Dividing the demand into 10% steps, i.e. into ten classes, is one possible pragmatic assumption. Figure 3 shows a sample of a trip distance distribution. It is shown that each class has approximately the same number of trips and the resulting graph approximates a straight line.

Fig. 3
figure 3

Example of an equiquantile trip distance distribution

Equiquantile classes can be applied for a particular demand segment, for the total demand and for the person distance travelled. Those applications are described in detail in the appendix.

Analysis with equidistant classes

As previously mentioned, the number of classes and the class width must be predefined for the equidistant classification method. A common procedure to determine those parameters is Sturges’ rule for constructing histograms, which is discussed e.g. by Hyndman (1995).

Another way of determining the class width is to define the smallest class of the equiquantile distributed total demand as the equidistant class width. This class width represents the smallest class with 10% of the total demand. In a strictly equidistant distribution, the number of classes is not important, i.e. classes of equal width are created until the largest classification indicator is reached. However, in this case empty classes may occur.

In a variation of the equidistant classification, the statistically unreliable, i.e. low-occupied, classes are aggregated. However, in order to use this particular classification method the input parameters number of classes, size of the grouped classes and minimum occupancy of a class need to be defined in advance. Qualivermo (Sammer et al. 2012) suggests class widths for these cases:

  • for metropolitan traffic in short distance areas: 2 km or 5 min,

  • for regional and long-distance traffic: 5 km or 10 min.

Application of distributions for validation and presentation purposes

The equidistant classification method requires three parameters as input (number, size, aggregation rules of classes). This leads to several reasonable combinations of parameters. In contrast, the equiquantile method requires only one parameter (number of classes). As the equiquantile method it the more generic representation of a demand distribution it should be used for model validation purposes.

Results from an equidistant classification display typical patterns of travel demand, e.g. decreasing demand with increasing distance. This may be helpful for interpreting results. Thus, an equidistant classification should be used for presentation purposes, but not for validation purposes.

Selection of a quality indicator for distribution comparisons

To check the congruence of two distributions they can be plotted in a common diagram. However, a mere visual examination of congruence is often not adequate. In the following, a selection of different indicators to quantify the similarity or conformity of two distributions is presented.

Comparison of distribution parameters

In general, distributions, i.e. their position, appearance and properties, can be described by distribution parameters. By a comparison of those distribution parameters, it is possible to determine the similarity between two distributions. Therefore, it is essential to record them for each distribution. These parameters are in particular:

  • sample size N (this is equivalent to the total number of trips),

  • weighted mean \( \bar{m} \) (this represents the mean indicator, e.g. mean trip distance),

  • standard deviation sm,

  • coefficient of variation Vm,

  • skew of the distribution \( \gamma_m \) and

  • percentiles of the distribution, e.g. \( Q_{0.05} ,\;Q_{0.15} ,\;Q_{0.25} ,\;Q_{0.5} ,\;Q_{0.75} ,\;Q_{0.85} ,\;Q_{0.95} . \)

The calculation specifications of the parameters are shown in the appendix.

Statistical tests

Various statistical tests can be used to check for differences between distributions and their parameters, respectively. For example, the t-test checks for significant deviations of the mean and the F-test checks for significant variance deviations (Backhaus et al. 2011).

In addition to these parametric tests, which check the distribution parameters, there are non-parametric or distribution-free tests, which check whether two samples derive from a common population. An example of this is the Kolmogorov-Smirnoff-test, suitable for interval scaled data. (Herz et al. 1992)

The Kolmogorov-Smirnoff-test initially determines the largest deviation in terms of absolute values between two discrete relative frequency distributions. A comparative measure is then obtained depending on the sample sizes and the accepted level of error. It should be noted that in this test only the position of relative frequencies is assessed.

In contrast to a household travel survey, a travel demand model represents a full census, which means that it covers a large sample size. Consequently, the test criterion is very strict, which often results in a negative test result (i.e. the distributions do not originate from a common population).

Quality indicators

Quality indicators consider the similarity of two distributions x and y. This section will give an executive summary about available quality indicators. The corresponding calculation specifications for each indicator are shown in the appendix.

Correlation Coefficient and Coefficient of Determination

The Correlation Coefficient R and the Coefficient of Determination R2 represent quality indicators that check the dependency on two (typically unclassified) datasets. R is a non-dimensional variable ranging between − 1.0 and 1.0 and reflecting the extent of the linear dependency between two data sets. R2 expresses the amount of variation explained by the independent variables of the model. The higher the Coefficient of Determination, the higher the proportion of data explained by the regression function.

It is to be considered critically that the Coefficient of Determination on the one hand does not indicate any causality of the observed correlation and on the other hand inevitably increases with an increasing number of examined factors (Cambridge Systematics Inc. 2014; Backhaus et al. 2011). Therefore, R2 is suitable as a quality indicator only to a limited extent. The thresholds for acceptable R2 values mentioned in the literature vary between 0.88, 0.95 (Cambridge Systematics Inc. 2014) and 0.98 (Department for Transport 2014).

Mean Absolute Error, Euclidean Distance and Root Mean Squared Error

A common quality indicator is the Mean Absolute Error (MAE). It describes, as the name implies, the average of all deviations between the modelled and observed values. One way to scale the MAE to a unit-less quality indicator is to divide the MAE by the sum of the observed values. The resulting quality indicator is called Relative Mean Absolute Error (%MAE). This  %MAE should not be confused with the Mean Absolute Percentage Error (MAPE) (Vandeput 2019).

The Euclidean Distance d considers the sum of the squared deviations across all distribution classes. Since, as will be explained later, the use of relative frequencies is advisable, a standardization to a common unit, as recommended e.g. by Backhaus et al. 2011, is not necessary.

Another common quality indicator is the Root Mean Squared Error (RMSE) and its relative and unit-less form (%RMSE). Compared to the MAE, the RMSE weights errors more strongly and thus has a stronger influence on the evaluation result (Cambridge Systematics Inc. 2014; Vandeput 2019).

Theil’s Forecast Accuracy Coefficient

Theil’s Forecast Accuracy Coefficient examines the conformity of two distributions. Unfortunately, two indicators with the same name U have been developed. In order to distinguish them they are called U1 and U2.

  • U1 ranges between 0 and 1, where 0 indicates a perfect match and 1 indicates the worst match. The worst match occurs either with a negative proportionality between the considered distributions or when both distributions are continuously equal to zero. Since all distributions are evaluated better than a naïve forecast, it is not possible to interpret them unambiguously and therefore U1 should not be used (Bliemel 1973; Andres and Spiwoks 2000).

  • U2 ranges from 0 to ∞, where 0 indicates a perfect match. At U2 = 1 the comparative distribution has the same quality as a naïve forecast. Consequently, U2 > 1 indicates that the comparative distribution is worse to evaluate since even a naïve forecast is evaluated better. U2 is always preferable to U1 (Bliemel 1973, Andres and Spiwoks 2000).

Furthermore, the deviations between two distributions can be analysed more precisely with three components of error UM, US and UC (FGSV 2006; Bliemel 1973; Andres and Spiwoks 2000).

  • UM indicates the proportion of the mean squared error resulting from an inequality of the mean values, which in turn results from a systematic over- or underestimation.

  • US indicates systematic differences in the variances.

  • UC indicates the absence of a linear correlation between the two distributions. It is an indicator for unsystematic, random errors.

According to Andres and Spiwoks (2000) and FGSV (2006) a good conformity is assumed if UM and US are close to 0 (\( U^{M} ,U^{S} \to 0 \) or at least \( U^{M} ,U^{S} < 0.2 \)) and when the total error is mostly due to unsystematic errors \( \left( {U^{C} \to 1} \right). \)

Vortisch’s Indicator of Similarity

A distance indicator presented by Vortisch (2006) takes into account the similarity in form and position of two distributions (see Fig. 4).

Fig. 4
figure 4

Similarity in form and position of two distributions (modified figure according to (Vortisch 2006))

The Correlation Coefficient R describes the similarity of the shapes. Furthermore, the positional similarity θ and the overlapping of the domains σ are determined. The individual components are then combined to a general distance indicator ∆. If the distributions match perfectly, ∆ = 0 will result and with increasing difference it will strive against ∆ = 1.

Coincidence Ratio

The Coincidence Ratio (CR) determines the degree to which two distributions overlap (see Fig. 5). Input value for the calculation are relative frequencies of classified distributions. The CR ranges from 0 to 1, where 1 indicates a perfect match and 0 indicates no match. According to Travel Model Validation and Reasonableness Checking Manual (Cambridge Systematics Inc. 2014), a high degree of congruence applies if CR > 0.7.

Fig. 5
figure 5

Overlapping of distributions

Absolute or relative frequencies

As shown in Table 2, a comparison of two distributions can either use relative or absolute frequencies. The choice of relative or absolute frequencies influences the value of the quality indicators. Hence, if two distributions are to be evaluated in terms of their similarity, it must be clarified beforehand whether the quality indicator used should operate with absolute or relative frequencies. As distance and time distributions derived from surveys are usually provided only as relative values, relative frequencies should be used. Furthermore, in a travel demand model, the absolute number of generated trips should be checked directly after trip generation. If this trip generation validation is omitted, systematic differences, such as permanently too few trips per class, are not recognized.

Summary on quality indicators

Table 2 compares the presented quality indicators by the following criteria (☒ meaning “yes” and ☐ meaning “no”):

  • What is the possible range of values of the indicators?

  • Does the calculation specification require absolute or relative frequencies?

  • Is there a difference in the resulting values of the indicator when using relative and absolute frequencies?

  • Is there a possibility for in-depth analysis using the same indicator?

  • Are there applicable thresholds for the indicator?

Table 2 Comparison of quality indicators

In a benchmark comparison, over 30 equiquantile distributions with 10 classes were compared. These were derived from real travel demand models as well as systematically generated in order to examine certain characteristics of the indicators (see top Fig. 6). The box plots in bottom Fig. 6 show the result of this comparison. The following conclusions can be derived:

Fig. 6
figure 6

Top figure: distribution types examined for a benchmark comparison, Bottom figure: result from the benchmark comparison of the different quality indicators

  • Correlation Coefficient R and Coefficient of Determination R2 fail for ideal equiquantile distributions, because the mean of the reference distribution matches the demand per class, resulting in an unsolvable expression.

  • The results for Euclidean Distance d, Mean Absolute Error MAE and Root Mean Squared Error RMSE show a very small bandwidth in the present case of relative frequencies. Therefore, their relative forms Relative Mean Absolute Error  %MAE or relative Root Mean Squared Error  %RMSE are to be preferred. A general statement as to which of these quality indicators is more suitable is not possible, since this depends, among other things, on the respective accuracy requirements of the model.

  • The expressions for Relative Root Mean Squared Error  %RMSE and Theil’s Forecast Accuracy Coefficient U2 are mathematically identical in this case of relative deviations and classified data.

  • Due to the properties described above, Theil’s Forecast Accuracy Coefficient U2 and Vortisch’s Indicator of Similarity ∆ are suitable because of their sophisticated analysis options. However, the latter is less sensitive, which is reflected in the fact that it does not make full use of its value range even in the case of large deviations.

  • The Coincidence Ratio CR is sensitive to changes, it can take values in its entire value range and its value range from 0 to 1 speaks for a straightforward interpretability. In addition, specific thresholds from the Travel Model Validation and Reasonableness Checking Manual (Cambridge Systematics Inc. 2014) provide good orientation for the evaluation of quality.

In summary, the Coincidence Ratio CR, the Relative Root Mean Squared Error  %RMSE (or Theil’s Forecast Accuracy Coefficient U2) and the Relative Mean Absolute Error  %MAE seem suitable for the evaluation of equiquantile classified relative frequency distributions. In addition, the three components of Theil’s Forecast Accuracy Coefficient UM, US and UC are useful for in-depth analyses.

Conclusion and recommendations

Appropriate transport planning requires appropriate travel demand models. Currently, many discussions are taking place about what “appropriate” actually means. This paper is a contribution to those discussions. Frequency distributions are an important quality feature of travel demand models. In quality assessment it is particularly important that the classification method is known, because the current techniques that are used in everyday practice lead to individual classifications and thus to different evaluation results. This makes it impossible to compare validation results of different models and to define a quality indicator threshold in modelling guidelines.

In the previous chapters, various classification and quality determination methods have been presented. In order to ensure a uniform classification method in quality assurance, the following procedure is proposed for the standardized creation and assessment of distributions. This proposal is intended to serve as a basic concept. For specific travel demand model applications, the specifications have to be adjusted if necessary. Modelling guidelines should define a generally obligatory procedure in the future.

  1. 1.

    Selection of mode-independent classification indicators for each distribution:

    • for trip distance distributions: direct distance,

    • for trip time distributions: mean weighted trip time in the reference situation.

    • The intrazonal trips are not taken into account in this evaluation. They have to be evaluated separately.

  2. 2.

    Specification of the study area:

    • only areas covered by the reference and the model are considered,

    • only trips with origin and destination inside the study area are considered,

    • other trips have to be evaluated separately.

  3. 3.

    Specification of the reference distribution for each application:

    • for calibrating and validating a travel demand model: a distribution from a household travel survey (keeping in mind that such surveys contain a systematic error),

    • for comparisons of modelled scenarios: a distribution from the base case of a travel demand model.

  4. 4.

    Calculation of ten equiquantile classes for all relevant demand segments and for the total demand and

    • visualization of the distributions with relative frequencies,

    • presentation of the distribution parameters,

    • using the Coincidence Ratio as quality indicator (calculation with relative frequencies). According to Travel Model Validation and Reasonableness Checking Manual (Cambridge Systematics Inc. 2014), the Coincidence Ratio should be CR ≥ 0.7 for a high level of congruence.

  5. 5.

    Optional: Visualization of equidistant distributions. To determine the class width, it is possible to use the smallest class of the equiquantile classified of the total demand. Equidistant distributions should not be used for quality tests. They are created for display purposes only.