Abstract
This paper presents an automated approach for providing ranked lists of outliers in observed demand to support analysts in network revenue management. Such network revenue management, e.g. for railway itineraries, needs accurate demand forecasts. However, demand outliers across or in parts of a network complicate accurate demand forecasting, and the network structure makes such demand outliers hard to detect. We propose a twostep approach combining clustering with functional outlier detection to identify outlying demand from network bookings observed on the leg level. The first step clusters legs to appropriately partition and pools booking patterns. The second step identifies outliers within each cluster and uses a novel aggregation method across legs to create a ranked alert list of affected instances. Our method outperforms analyses that consider leg data without regard for network implications and offers a computationally efficient alternative to storing and analysing all data on the itinerary level, especially in highlyconnected networks where most customers book multileg products. A simulation study demonstrates the robustness of the approach and quantifies the potential revenue benefits from adjusting demand forecasts for offer optimisation. Finally, we illustrate the applicability based on empirical data obtained from Deutsche Bahn.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction and stateoftheart
Revenue management (RM) is a challenging task for network service providers. The concept entails controlling the set of product offers over a fixed sales horizon such that, given the predicted demand for the offers, the expected revenue from selling a limited capacity is maximal. To thus maximise revenue, the firm has to forecast the expected demand for all products that require capacity on mutual resources.
Examples include transport itineraries that cross several network legs, and hospitality offers that combine room availability for multiple nights.
Several existing contributions, e.g. Weatherford and Belobaba (2002) and Rennie et al. (2021), demonstrate the negative effects of inaccurate demand forecasts on revenue performance but neglect network effects. Motivated by this, we propose a new approach to detect outliers in network bookings, thereby supporting forecast corrections for improved network revenue management.
1.1 Terminology
For simplicity, we employ a transportbased terminology throughout this paper: a leg describes a direct nonstop connection between two stations in a network, and an itinerary is any combination of legs that can be jointly booked as one product. A departure describes a journey along a connected series of legs that leave the origin station at a unique time and date.
We denote the accumulation of bookings across the sales horizon as a booking pattern. We define an outlier as a booking pattern resulting from shortterm systematic demand changes for one or several related network itineraries. These outliers occur when demand deviates from the baseline due to unforeseen events. For example demand increases for a specific destination affect the entire itinerary. In consequence, deviations in the booking patterns are observable on the legs arriving at that destination and in the feeder legs.
Capacitybased revenue management differentiates offers through fare classes. Fare classes describe combinations of fares and tariffs at which the firm offers a product. Customers booking a ticket for a specific departure may choose from several offered fare classes. For instance, the cheapest offer could be fare class ‘M’, costing 20 Euros and entailing a norefund tariff.
1.2 Existing work
RM is a wellstudied problem for many different products and services (Talluri and Van Ryzin 2004). Still, only recently the specific issues around network services and demand forecasting have come into focus. E.g. Klein et al. (2020) review how singleleg practices to RM generalise to the network setting. Weatherford (2016) surveys RM forecasting methods and focuses on airline itinerarylevel forecasting.
So far, few authors have examined demand outliers in RM data. For historical hotel booking data, Weatherford and Kimes (2003) discuss a simple method of removing observations that are more than \(\pm 3\sigma\) away from the mean. Rennie et al. (2021) apply functional analysis to detect outliers on individual legs. Neither, however, consider outliers affecting multiple legs of a network. In Azadeh et al. (2013), the authors identify outliers in network railway bookings via a simple rule to remove them before forecasting future demand. For a slightly different perspective, Kumar and Khani (2020) analyse transit demand for outliers to detect special events. Notably, existing research on outlier detection frequently focuses on binary outlier detection without regard for quantifying how critical an outlier is.
Practical network RM relies on manual forecast adjustments (Quante et al. 2009; Schütze et al. 2020). The previous research has shown that the resulting judgemental forecasts can be biased and even superfluous (Lawrence et al. 2006; De Baets and Harvey 2020). Perera et al. (2019) note that forecasting support tools can improve user judgement by reducing complexity for the analyst. Analysts’ time is limited, so they cannot investigate every departure flagged as an outlier. For example Deutsche Bahn experts estimate that they can reasonably adjust less than 1% of forecasts. Therefore, ranking outliers by criticality is crucial.
Beyond RM, Barrow and Kourentzes (2018) propose a functional approach for outlier detection in call arrival forecasting without regard for network effects. General outlier detection in networks often focuses on identifying outlying parts of the network. Fawzy et al. (2013) use this approach in wireless sensor networks to find faulty nodes. Ranshous et al. (2015) consider the extension to identify outlying nodes when the network changes over time. Most research on dynamic networks concentrates on analysing a single time series connected to each node rather than a set of time series, as required when booking patterns are reported for multiple departures. Hyndman et al. (2016) note that the problem of identifying unusual time series within a collection is not as extensively studied as other outlier detection problems. In this paper, we benchmark the approach suggested by Hyndman et al. (2016), which employs principal component analysis (PCA), against our newly proposed approach.
1.3 Contribution
We shall study booking patterns that result when customers book not just a single resource (leg) but network products that require multiple resources (itineraries). Such booking patterns may be reported on the leg or the itinerary level; in this paper, we assume that they are reported per leg and departure. This applies in the case of Deutsche Bahn, which serves as a motivation and empirical demonstration for the work presented here.
Network effects challenge outlier detection in two ways: On the one hand, demand outliers on the itinerary level affect bookings on all legs included in the itinerary. On the other hand, such outliers may not be recognisable when only considering leg bookings independently, given the noise from other itineraries overlapping those legs. As a result, directly extracting outliers from booking data collected in an entire, realistically sized network is likely an intractable problem. To circumvent this problem, in this paper, we aggregate and analyse booking patterns from legs instead of itineraries, as this allows for computationally and statistically tractable networkwide outlier detection. In Sect. 6, we further discuss the choice of leglevel vs. itinerarylevelbased analysis and point out how our procedures could be adapted to itinerarylevel outlier detection.
Our network outlier detection procedure: (i) clusters legs with similar booking patterns and (ii) detects joint outliers within each cluster to compile ranked alert lists of outlying departures and affected legs. Our methodology significantly improves outlier detection performance in a network setting versus alternative methods.
In more detail, our proposed approach first clusters legs by measuring the similarity of booking patterns via functional dynamical correlation (Dubin and Müller 2005). We suggest this measure for its freedom from restrictive assumptions. As the proposed approach is modular, other correlation measures could be used for the same end. In the second step, the proposed approach detects outliers from booking patterns within each cluster by combining the functional data analysis methods of Febrero et al. (2008); Hubert et al. (2012); Rennie et al. (2021) with a novel withincluster aggregation, which generates a ranked alert list of outliers using extreme value theory. This alert list can help analysts to identify the need for further analysis and adjustments. We consider an outlier as more critical if it indicates a larger demand shift and if it is identified across multiple legs. Factors such as the average fare on legs where outliers are detected, or the revenue at risk from faulty forecasts could also be incorporated into the definition of an outlier’s criticality.
Finally, analysts have several choices when tasked with forecast adjustment for network services. The best choice is not obvious, and we further quantify the impact of different potential adjustments on revenue in a simulation study, following concepts outlined in Kimms and MüllerBungart (2007).
In summary, this paper contributes (i) a method for identifying network legs that will benefit from joint outlier detection and (ii) a method to aggregate outlier detection across any number of legs to create a ranked alert list. To thoroughly evaluate the proposed approach, we offer (iii) wideranging simulation studies to benchmark the method’s outlier detection performance against and to quantify the potential revenue improvements from forecast adjustments and (iv) a demonstration of applicability on empirical railway booking data from Deutsche Bahn.
2 Method
Several network products may rely on common resources when demand concerns multiple legs at once. In the transport example, even passengers that booked different itineraries often have to traverse the same legs. Therefore, specific legs share common outliers, as, for example a sudden increase in demand from passengers travelling from one end of the network to attend an event at the other end would increase demand for each of the inbetween legs. Neither considering each leg independently nor jointly considering the whole network will create the best results when the network spans multiple regions that differ strongly in demand—see Sect. 3.4 and Appendix 6. This raises the question of which legs to consider jointly for outlier detection.
To find an answer, in Sect. 2.1, we adapt a method by Zahn (1971) to cluster legs such that (i) legs in the same cluster share demand and can be considered jointly for outlier detection, and (ii) legs in different clusters experience distinct demand and should be considered separately. Subsequently, in Sect. 2.2, we suggest a method for analysing bookings within one such cluster. Based on this, we propose a method to rank departures by the severity of identified outliers.
2.1 Clustering legs using correlationbased minimum spanning trees
To cluster legs based on correlations in observed bookings, we first consider the network as a graph where nodes represent the stations and edges represent the legs of a journey. Figure 1a illustrates this on a simple network. To illustrate the concept, we rely on an example from the transport domain: In this example, two train lines (red and blue) intersect at two stations (B and C). The red train arrives at stations B and C before the blue train, which creates two possible transfer connections for passengers: (i) switch from red to blue at B and (ii) switch from red to blue at C. Transfers from the blue to red train are not feasible.
Standard graph clustering algorithms, as exemplified in Schaeffer (2007), seek to cluster the nodes of the graph. In contrast, we wish to cluster similar edges, which correspond to legs in the railway example (Fig. 1a). Hence, we invert the graph to make existing clustering algorithms applicable. In this inversion (Fig. 1b), the directed edges become nodes, e.g. the edge from A to B becomes node AB. The inverted graph features an undirected edge between two nodes when:

both legs are in the same train line and share a common station, e.g. legs CD and DE are connected through station D, or

the legs are in different train lines but share a common transfer station where a connection is possible, e.g. leg FB (red line) and BC (blue line) are connected through station B. However, AB (blue line) and BC (red line) would not be connected by an edge as no connection can be made between them (as we have assumed the red train arrives at B and C before the blue train).
In theory, this transformation could also create edges between legs that share a common entry or exit node, e.g. FB (red line) and AB (blue line), or CG (red line) and CD (blue line). Given that such pairs of legs would never occur in the same itinerary, we would not expect demand outliers to affect both legs. Therefore, inserting an edge between them, potentially allowing them to be in the same cluster, is counterintuitive. In addition, exploratory analyses of the empirical data found that correlations between these types of legs were very low across the entire network section.
The algorithm aims to assign those legs that experience similar bookings to the same cluster and those that experience dissimilar bookings to distinct clusters. A corresponding metric only needs to consider the similarity between adjacent legs that share a connecting station since edges do not otherwise exist in the inverted graph. We propose to quantify this similarity via the correlation between booking patterns.
To calculate correlations between booking patterns, we compute the functional dynamical correlation (Dubin and Müller 2005). Functional dynamical correlation is based on calculating scalar products between pairs of smoothed booking patterns; the appendix provides further details. We use the average of these paired correlations over time as the similarity measure between any two legs. Unlike more common statistical correlation measures, such as Pearson correlation, functional dynamical correlation does not assume a specific type of relationship between variables (e.g. linearity). It also accounts for the time dependency between observations within the booking horizon when the intervals between observations vary. For example in the empirical RM data analysed in Sect. 5, the time between observations decreases as the departure date approaches. Further, alternative measures for calculating correlations from functional data (such as functional canonical correlation) often make restrictive assumptions, which real data do not fulfil (He et al. 2003).
We benchmark the clustering algorithm under alternative correlation measures in Appendix 8.
To represent the relationship between legs in the network, i.e. the nodes in the inverted graph, we attach weights to the edges in the inverted graph. These weights are interpreted as distances: A higher edge weight indicates that the connected nodes are more dissimilar. Therefore, an applicable weight function should be nonnegative. Further, the weight function needs to ensure that any negatively correlated legs are marked as more dissimilar. Even though a negative correlation may imply that outlier demand jointly affects both legs, we expect it to affect negatively correlated legs differently. Therefore, these require different adjustments from an analyst and should be in different clusters. To satisfy these requirements, we define the edge weights as:
where \(\rho (ij, jk)\) is the correlation between bookings on legs ij and jk. Though the use of functional dynamical correlation as a measure of similarity between time series is not new, its application as an edge weight in a network setting, to our knowledge, is novel.
To allow for irregular cluster shapes, we recommend a minimum spanning tree (MST) algorithm (Prim 1957). For example in Fig. 1b, a cluster may include AB and DE because they are in the same line, rather than clustering AB and FB. Minimum spanning tree approaches work well for clusters with irregular boundaries (Zahn 1971). Alternative clustering approaches (such as kmeans) often assume a specific shape of clusters (spherical, for kmeans). MSTbased clustering approaches also do not assume that clusters are of similar sizes (Peter and Victor 2010). This makes them particularly suitable for transportation networks constructed as a series of interlocking lines, where the points of intersection are often not equally spaced. For example MSTbased approaches have previously been used in optimising layouts of railway networks (Liang et al. 2020).
A spanning tree of a graph is a subgraph that includes all vertices in the original graph and a minimum number of edges, such that the spanning tree is connected. Then, the MST is the spanning tree with the minimum summed edge weights—see Fig. 1c. Since the inverted graph is weighted, we use Prim’s algorithm (Prim 1957) to calculate the MST—Appendix 2 provides a detailed introduction. Any onetoone transformation of the weight function, \(w_{(ij, jk)}\), will produce an identical minimum spanning tree.
There are two approaches to obtaining clusters from an MST: (i) predefining the number of clusters as k and removing the \(k1\) edges with the highest weight; or (ii) setting a threshold for the edge weights and removing all edges with weights above some threshold, creating an emergent number of clusters. Here, we implement the thresholdbased approach, ensuring that each cluster has the same minimum level of correlation. In contrast, setting the number of clusters in advance could result in very heterogeneous levels of correlation across clusters. Further, setting k too low may result in legs with dissimilar features being grouped together. We apply a threshold correlation of 0.5—the level at which legs are more correlated than they are not. This corresponds to a transformed edge weight of 0.5. In the example given in Fig. 1c, this means removing all legs with a weight above 0.5, resulting in the three clusters shown in Fig. 1d. The choice of this clustering threshold will impact the number of alert lists produced. Therefore, we recommend considering factors such as staffing resources and any current (informal) network clustering when choosing this threshold.
While the outlier detection procedure described next applies to individual clusters, it does not require a particular clustering approach. Hence, other implementations may employ alternative approaches, as reviewed in Schaeffer (2007). In particular, depending on the business context of the network service, alternative clustering algorithms may be more appropriate. The network topology should drive the choice of which clustering algorithm is most appropriate: That topology may differ, e.g. when considering airlines versus bike rentals versus railways, as discussed in Rennie et al. (2022). The choice of an MSTbased approach, which often returns linear clusters, is appropriate for the railway application motivating the paper at hand, given the linear nature of the underlying network structure. We further evaluate the performance of MST clustering in this application in Appendix 8.
Furthermore, edgebased clustering could replace the graph inversion and nodebased clustering presented here. However, the literature on edgebased clustering is far more limited, and such approaches tend to improve the visualisation of networks with a very high number of edges by reducing the number of edge crossings rather than grouping together the most similar edges (Qu et al. 2007). In contrast, inversion and nodebased clustering aim to group network legs that exhibit the highest degree of similarity. However, alongside these advantages, there may be some drawbacks. The nodebased approach requires deciding on criteria to select edges to include in the inverted graph.
2.2 Detecting outliers in clusters of legs
Given established clusters, we propose identifying demand outliers within each cluster and quantifying their severity to provide a ranked alert list of departures. The previously described clustering allows for processing the outlier detection in parallel for separate clusters, enabling efficient computing.
To identify which departures to include in the alert list, we consider the functional depth of the booking patterns, as in Rennie et al. (2021). This step could also rely on other measures of exceedance, including univariate 'threshold' approaches, which look at aggregated bookings and ignore the distribution of bookings over time. We propose to rely on functional depth, as the previous work has found this to be the most effective as an outlier detection mechanism (Rennie et al. 2021).
To compute the functional depth, consider N departures observed over L legs. Let \(\varvec{y}_{nl} =\left( y_{nl}(t_1), \hdots , y_{nl}(t_{T}) \right)\) be the booking pattern for the \(n^{th}\) departure on leg l, observed over T booking intervals \(t_1,\ldots ,t_T\). Let \({\mathcal {Y}}_l\) be the set of N booking patterns for leg l. For each leg and departure, calculate the functional depth (\(d_{nl}\)) given the related booking patterns following the approach given in Hubert et al. (2012) and detailed in Appendix 3. The functional depths take on positive values, with smaller values of the depths relating to more outlying booking patterns.
For each leg l, we calculate a threshold for the functional depth using the approach of Febrero et al. (2008). This method (i) resamples the booking patterns with probability proportional to their functional depths (such that any outlying patterns are less likely to be resampled), (ii) smooths the resampled patterns, and (iii) sets the threshold \(C_l\) as the median of the \(1^{st}\) percentiles of the functional depths of the resampled patterns. Here, we use the \(1^{st}\) percentile of the depths as the default threshold, as this has been found to work well in practice (Febrero et al. 2008; Rennie et al. 2021). Booking patterns with a functional depth below the threshold \(C_l\) are classed as outliers. We explore alternative threshold choices in Appendix 2.
To create ranked alert lists, we first define \(z_{nl}\) to be the normalised difference between the functional depth and the threshold:
This transforms the depth measure \(\textrm{d}_{nl}\) into a measure of threshold exceedance. Values of \(z_{nl}\) greater than zero relate to booking patterns classified as outliers. Normalising by the threshold, \(C_l\), ensures that the values of \(z_{nl}\) are comparable between different legs.
Next, we define the sums of threshold exceedances across legs:
We sum only those values of \(z_{nl}\) that are greater than zero to avoid outliers being masked when they occur only in a subset of legs. This sum implicitly accounts for both the size of an outlier—larger outliers further exceeding the threshold, resulting in larger values of \(z_{nl}\)—and for the number of legs where a departure is classified as an outlier (by summing a larger number of nonzero values). To provide an example, Fig. 2 shows those values of \(z_n\) that exceed zero for a fourleg section of the Deutsche Bahn network as discussed further in Sect. 5.2. These values of \(z_n\) correspond to departures where the booking pattern for at least one leg is identified as an outlier. In contrast, all other departures have no detected outliers in any leg such that \(z_n=0\).
To create a ranked list of outlier departures, i.e. those with a nonzerosum of threshold exceedances, we assign a severity \(\theta _n\). A higher value of \(\theta _n\) indicates that the departure is more likely to be affected by extreme outlier demand and hence should be targeted first by RM analysts.
To model threshold exceedances, we turn to extreme value theory (EVT)—a branch of statistics that deals with modelling rare events occurring in the tails of a distribution. Given that outliers are unusual events, which occur in the tails of distributions, EVT is a clear direction to turn to for modelling outliers—see Talagala et al. (2019). There are two common approaches to EVT: (i) block maxima, which examine the maximum value in evenlyspaced blocks of time, e.g. annual maxima, and (ii) peaks over the threshold, which examines all observations that exceed some threshold (Leadbetter 1991). The generalised Pareto distribution (GPD) is commonly used to model the tails of distributions in the peaks over threshold approach (Pickands 1975). Motivated thus, we fit a generalised Pareto distribution (GPD) to the sum of threshold exceedances given in equation (3). The GPD has three parameters with probability density function:
for
Here, \(\mu\) specifies the location, \(\sigma\) the scale, and \(\xi\) the shape of the distribution. We fit the parameters using maximum likelihood estimation (Grimshaw 1993) via the R package POT (Ribatet and Dutang 2019). A kernel density estimate of the empirical distribution of \(z_n > 0\) from Fig. 2 is shown in Fig. 3a. The resulting fitted GPD is shown in Fig. 3b. As the further analysis in Appendix 17 shows, the GPD fit appears reasonable compared to the empirical distribution.
Two common issues arise in fitting GPDs: (i) the choice of threshold and (ii) the independence of the data points. When the threshold is too low, the assumption of a GPD no longer holds; when it is too high, there are too few data points to fit. We select a threshold of 0, i.e. we fit the GPD to values of \(z_n > 0\). Rather than change the threshold at the GPD level, we control the number of observations the GPD is fitted to by varying the percentile used for the individual leg thresholds, \(C_l\). We choose \(C_l\) as suggested by Febrero et al. (2008) and find that this choice works well and provides sufficient outlying points to fit a GPD in both simulated and empirical data.
To account for the second issue, applications of extreme value theory frequently first decluster the peaks over the threshold to ensure independence between observations (Fawcett and Walshaw 2007). To that end, the analysis may only consider the maximum of two peaks within a small time window. For transport departures, it is theoretically possible that observed outliers may be dependent; e.g. increased demand caused by Easter affects not only Easter Sunday but also the surrounding days. However, similar outliers may also result from independent events. As we aim to identify outlying departures rather than the underlying events, this argument causes us not to decluster here.
We define \(\theta _n\) as the nonexceedance probability given by the CDF of the GPD:
Formally, \(\theta _n\) is the probability that, given an outlier occurs, the sum of threshold exceedances is at least as large at \(z_n\). Thus, it is not the probability that a departure is an outlier. However, we use this nonexceedance probability as a measure of outlier severity on a scale of 0–1.
Departures with functional depths that do not fall below the threshold on any legs carry a severity of zero, i.e. they are classified as regular departures. It is conceivable to estimate the uncertainty of \(\theta _n\) (Smith 1985) to determine further levels of criticality, e.g. if there are several departures with the same outlier severity, the one with the smallest uncertainty would be ranked first. However, given the continuous nature of the data, it is unlikely that multiple departures carry an identical severity. Hence, we leave uncertainty estimation to future research.
From the severity defined in equation (6), we construct a ranked alert list containing all departures with a nonzero outlier severity. Although functional depth could be directly used to construct the ranked alert list, computing the severity provides a measure of the difference between ranks and is more easily interpreted by analysts. The top 8 ranked outliers relating to Fig. 2, are shown in Table 1.
In practice, RM analysts’ time and resources allow them to examine and adjust controls or forecasts only for a limited number of suspicious booking patterns. Those departures that (i) exceed the functional depth threshold in only one leg or (ii) exceed the threshold only to a small degree have lower but strictly nonzero severity. These outliers are most likely false positives and potentially waste analysts’ time. Hence, we suggest limiting the length of the list in practice.
To limit the length of the alert list, we might (i) only include departures if their severity is above some threshold or (ii) set a maximum length. Since we wish to control the number of alerts an analyst will receive, we analyse outlier detection performance as dependent on the maximum length of the alert list. Recall that we classify departures as outliers if and only if their outlier severity exceeds zero. Therefore, if the required length of the alert list exceeds the number of identified outliers, we do not include further departures. Appendix 7 features further result on the outlier detection performance when varying the outlier severity threshold.
3 Outlier detection performance
We first implement a simulation study to evaluate the outlier detection performance given known outliers. By varying the demand for itineraries in one cluster, we create outliers that are observable on both the leg and network levels.
The simulation models a network consisting of five stations and four legs, as shown in Fig. 4, mirroring the structure of an empirical railway network cutout. The network includes 10 possible itineraries represented by \({\mathcal{O}} = \{\mathrm{AB, AC, AD, AE, BC, BD, BE, CD, CE, DE}\}\). On each itinerary, the firm offers seven fare classes. In this model, a fare class describes a particular price or fare associated when booking a ticket to travel the itinerary in that class. There are no additional restrictions differentiating classes.
3.1 Demand settings
Extending the demand model described in Rennie et al. (2021) to the network setting, the simulation generates booking requests per customer type i according to a nonhomogeneous Poisson process, where the arrival rate per itinerary o, \(\lambda _{i,o}(t)\), at time \(t\), is given by:
here \(\phi _{io}\) is the fraction of customers of type i and \(D_{o} \sim \text{ Gamma }(\alpha _{o},\beta _{o})\) with probability density function:
where \(a_{io}\) and \(b_{io}\) define are the parameters of a Beta distribution which defines how customers arrive over time. We generate demand over a horizon of 3600 time slices to ensure \(\lambda _{i,o}(t) < 1\). This level of detail is required to accurately parameterise the dynamic program for bid price control. The resulting bookings are aggregated into 18 booking intervals.
As in Rennie et al. (2021), we consider differentiated demand from two customer types represented by the set \({\mathcal {I}} = \{1,2\}\). We assume that customers book the cheapest available fare class and differ in price sensitivity. We define \(p_{ijo}\) as the probability that a customer of type i pays up to fare class j on itinerary o. By combining demand from two customer types that differ in price sensitivity with offers that depend on the current set of offered classes, we mimic a realistic price effect: Offer prices result from the cheapest class currently offered by the firm, as customers will buy the cheapest available class. When a customer’s willingness to pay does not equal or exceed the price of the cheapest available class, they do not buy; hence, their price sensitivity translates to decreased demand. Note that the price–demand response depends on the itinerary and time in the booking horizon.
Combining this demand model with the given network creates 210 demand parameters. Table 2 provides a full list of parameter values and interpretations of each parameter. We set the parameters to mirror common RM assumptions (Weatherford and Bodily 1992): (i) valuable customers from type 1 book later than customers from type 2, (ii) customers book earlier for longer journeys, and (iii) customers are willing to pay a higher fare class if they are travelling further. Most passengers book tickets boarding at A and leaving at E; this ensures the correlation between the legs exceeds 0.5 and guarantees that the legs are correctly modelled in the same cluster as detailed in Appendix 19.
We validate that the functional dynamical correlation between the four legs for simulated data is comparable to empirical railway data as detailed in Appendix 19. We generate all regular demand based on these parameters.
The simulation excludes trend and seasonality to evaluate outlier detection approaches in a bestcase scenario. In other words, if an algorithm fails on observations from stationary demand, it will likely not perform better given more demand variability. However, additional results based on simulation data that do feature seasonality can be found in Appendix 10.
3.2 Outlier generation and evaluation
We generate demandvolume outliers by changing the gamma distribution parameters that govern the total demand level according to equations (7) and (8). The previous work found that the proportion of outliers had little effect on outlier detection performance in the singleleg case (Rennie et al. 2021). Therefore, we generate booking patterns for 500 departures per demand setting, with 1% of departures experiencing outlier demand. That is, we generate 495 departures from the regular demand distribution and five outliers from a set of twelve outlier distributions where the mean has shifted by \(\pm 10\%\), \(\pm 20\%\), \(\pm 30\%\), \(\pm 40\%\), \(\pm 50\%\), and \(\pm 60\%\). For every shift in the mean, we reduce the variance of the outlier demand distribution by \(80\%\). This still results in an overall increase in the variance of total demand in the presence of outliers but also ensures that we sample sufficiently outlying demand values. Outliers may also occur due to factors such as changes in arrival times or changes in customers’ willingness to pay. Rennie et al. (2021) provide results on how the performance of functional depth varies under these different types of outliers. Here, we focus on the different types of outliers caused by varying network effects. In all cases, we consider the application of the outlier detection procedure to the constrained demand—applying the approach directly to the booking patterns without applying any unconstraining approaches first. The problem of unconstraining is one of the major challenges of demand forecasting for revenue management and is beyond the scope of this paper.
We differentiate outlier scenarios in terms of the affected network components. Firstly, we evaluate a scenario where outlier demand affects all network itineraries. We consider the case where each outlier is randomly drawn from one of the twelve outlier distributions, resulting in outliers from a mixture of different distributions. This lets us test whether the ranking of the alert list mirrors the outliers’ underlying degree of demand deviation. Then, we consider each of the twelve outlier distributions in isolation to assess the detection sensitivity. Secondly, we evaluate a scenario where outliers only affect a single itinerary. This evaluates the benefits of clustering multiple legs.
In Appendix 6, we consider the practically relevant case of outliers affecting a subset of itineraries and provide further details on all simulation experiments.
Each combination of outcomes can be classified into one of four categories: (i) assigning a nonzero outlier severity to a genuine outlier creates a true positive (TP); (ii) assigning a zero outlier severity to a regular observation creates a true negative (TN); (iii) assigning a nonzero outlier severity to a regular observation creates a false positive (FP); and (iv) assigning a zero outlier severity to a genuine outlier creates a false negative (FN). This classification enables us to compute the truepositive rate (TPR) for the top R ranked departures in the alert list:
where \(\mathrm{TP_R}\) is the number of true positives in the top R departures. The truepositive rate lies between 0 and 1, where 1 means all genuine outliers were identified. We evaluate performance across 1000 stochastic simulations.
In an ideal setting, the alert list should feature, from top to bottom, large outliers and, subsequently, smaller outliers. Therefore, we also use the distribution of outliers within the ranked alert list to evaluate how well the method ranks the most critical outliers.
3.3 Benchmarked outlier detection approaches
For benchmarking, we term the newly proposed approach FD+Agg and compare it to two alternatives from the literature: Principal component analysis combined with highdensity regions (PCA+HDR) as inspired by Hyndman et al. (2016), and the legbased functional depth analysis as proposed in Rennie et al. (2021).
3.3.1 Comparison with PCA+HDR
This benchmark (i) computes features (e.g. mean, variance, and curvature) of the booking patterns for the total demand in a cluster; (ii) uses PCA (Yang and Shahabi 2004) to identify the first two principle components from the features; and (iii) uses HDR, a densitybased approach (Hyndman 1996), to find the \(\nu\) points with the lowest density in the first two principal components. These points are classified as outliers. Extended details of the method, including the list of features, can be found in Appendix 6. This method provides an ordering of the outliers but not a severity measure, as illustrated by Fig. 5.
3.3.2 Comparison with nonranked, singleleg approaches
To highlight to critical features of FD+Agg, we benchmark (i) the use of severity measures to rank outliers and (ii) the inclusion of network effects. To isolate the effects of each of these features, we perform two separate benchmark tests:
We evaluate the effect of ranking outliers by measuring the increase in precision when ranking outliers. For example we consider the precision in the top 5 ranked departures versus 5 randomly chosen departures with nonzero outlier probabilities (i.e. as in Rennie et al. (2021)). The change in precision when considering the top R departures, \(\Delta (Precision)_{R}\), is given by:
where \(\mathrm{TP_{R(random)}}\) is the number of true positives in a random selection of R departures with nonzero severity, and \(\mathrm{FP_{R(random)}}\) is defined analogously for false positives.
We quantify the value of accounting for network effects by computing ranked alert lists for each leg in isolation. We then compare the truepositive rates to the aggregated, networkdriven approach presented in this paper.
3.4 Detecting outliers in multiple legs
As a first experiment, we consider the scenario where outlier demand equally affects all itineraries and legs within the cluster. For this scenario, Fig. 5a illustrates how the truepositive rate (TPR) increases when ranking outliers for different lengths of the alert list. The red line indicates the number of genuine outliers. The truepositive rates for our method (denoted as FD+Agg) are promising, with a TPR of around 0.2 for a list length of 1. Since there are five genuine outliers, this indicates that a genuine outlier is almost always ranked top. Results under different functional depth thresholds are given in Appendix 2.
3.4.1 PCA+HDR benchmark results
The PCA+HDR approach requires a given number of outliers to detect, \(\nu\), as input. Therefore, we compare the performance of the benchmark method under different choices of \(\nu\) to FD+Agg.
Figure 5b shows that the truepositive rate achieved by FD+Agg consistently exceeds that achieved by PCA+HDR. To achieve the same level of the truepositive rate, PCA+HDR would need to classify around 250 departures (i.e. 50%) as outliers. In comparison, FD+Agg achieves this rate starting at about 30 classified outliers. We consider this a successful validation of the effect of ranking outliers in FD+Agg. Appendix 1 lists these results in tabular format.
Figure 5c shows the distribution of each outlier magnitude in the alert lists. Under FD+Agg, the modes of the distributions generally fall where they should, as larger outliers are ranked higher. The smaller variance in the ranking of the larger magnitude outliers indicates that they are easier to detect. The higher variance of the mediumsized outliers can be explained as the ranking of a mediumsized outlier is dependent on which other types of outliers occur: If there is a large and a medium outlier, the medium outlier is ranked lower; if there is a small and a medium outlier, the medium outlier is ranked higher. The distribution of outliers detected by PCA+HDR, shown in Fig. 5d, also has the modes in the correct order. However, there is much more overlap between the distributions, showing its inability to correctly rank the outliers.
3.4.2 Comparison with nonranked approach
Figure 6a highlights how the precision improves when ranking outliers instead of listing them in random order. Ranking particularly improves precision when the alert list covers only a small number of departures. As domain experts indicate that analysts cannot target more than 1% of departures, ranking focuses resources and thereby provides large benefits in practice. Nevertheless, Fig. 6a (when contrasted with Fig. 5a) also highlights the tradeoff between reducing the number of false alerts and identifying all outliers. A shorter length of alert list increases precision but reduces the truepositive rate.
The increase in precision from applying our method compared to PCA+HDR is similar to the increase in precision from the inclusion of the ranking (see Fig. 6b). This suggests that PCA+HDR performs reasonably well in terms of outlier detection, but poorly in terms of ranking the outliers.
3.4.3 Comparison with singleleg approach
Figure 7 shows the truepositive rate when a ranked alert list is computed for each leg in isolation versus in the proposed aggregated manner. Here, we consider outlier demand generated by a 50% increase in the affected legs as an illustrative example. We analyse detection performance by breaking down results in terms of which itinerary the outlier demand is generated in. We show only the results relating to itineraries AB, AC, AD, and AE. Figure 26 in Appendix 5 details results for the further itineraries yielding similar conclusions.
For results, when outlier demand is generated across combinations of itineraries, refer to Appendix 6.
In all cases, the truepositive rate for clusters is higher than in any of the individual legs. This is because when considering the leg’s bookings in isolation under outlier demand that affects multiple legs, the noise from other itineraries prevents detecting the outlier in every leg. However, clustering increases the number of detected genuine outliers.
Aggregation is most beneficial when the outlier demand affects the most legs. In our example, this applies when itinerary AE experiences outlier demand, as shown in Fig. 7a. The lower truepositive rates in legs AB and DE result because different combinations of itineraries also utilise these legs. The aggregation is less beneficial when outlier demand affects an itinerary consisting of only one or two legs since we aggregate the analysis across legs that are actually not affected by outlier demand. However, there is a modest gain in truepositive rate even in this case—compare Fig. 7(c). This is due to the knockon effects of decreased capacity on the affected legs, impacting the bid prices for any itineraries which include these legs. For some lengths of the alert list, the leglevel truepositive rates are higher than the aggregated approach, due to false positives from unaffected legs being included in the list. However, even for itinerary AB (Fig. 7d), where false positives from unaffected legs are most likely, the difference is small and cancelled out by the overall increase in truepositive rate.
3.4.4 Sensitivity to different magnitudes of outliers
To better understand outlier detection performance, we break down the results by the magnitude of outliers in Fig. 8.
When outliers result from minor changes in demand levels, they are difficult to detect, resulting in low truepositive rates. Given the significant overlap between the distribution of outlier demand with a 10% change in magnitude and that of regular demand, this is to be expected. Therefore, 10% demand changes effectively provide a lower bound on how big an outlier needs to be in order to be detected.
As the magnitude of the outliers increases, they become easier to detect and truepositive rates are higher, with peak rates reached with shorter alert lists. Thus, genuine outliers are more likely to be ranked higher when they are caused by larger demand changes. For demand decreases of at least 50%, the truepositive rate is very close to the optimal detection rate. Negative demand outliers are slightly easier to detect than positive demand outliers, meaning shorter alert lists are required. This is due to the demand censoring imposed by the booking controls and capacity restrictions.
Figure 9 shows the precision gap over randomly ordered lists. Once more, larger magnitude outliers result in larger precision improvements from ranking, while detecting minor outliers gains little over random selection. Similarly, we observe that detecting negative demand outliers gains slightly more precision in comparison with detecting positive outliers of the same magnitude. Additional results regarding false discovery rates are available in the appendix.
4 Simulation study: forecast adjustments
To evaluate the implications of adjusting the demand forecast for further planning steps, we simulate network demand and the optimisation of offered fare classes over the booking horizon. We list and explain all parameters determining the settings in the simulation study in Appendix 7. In this section, we first detail how the simulated RM system uses the demand forecast to compute revenueoptimal offers based on bid prices. In that, it follows a widely implemented industry standard. Subsequently, we describe alternative strategies that analysts may apply to adjust demand forecasts based on identified outliers. Finally, by comparing revenue gained from offers based on different adjusted demand forecasts under the same simulated outlier demand, we highlight the effects of adjustments as dependent on outlier scenarios.
4.1 Network revenue management system
The simulated RM system controls the offered set of fare classes per itinerary to optimise expected revenue. To that end, it implements a dynamic program to compute bid prices per leg and sums them up per itinerary following the methodology described in Strauss et al. (2018) and detailed in the appendix. To test for the sensitivity of results with regard to the revenue optimisation, we compared two industry standards, the legbased EMSR heuristic as introduced in Belobaba (1987) and dynamic programming in initial simulations studies not further documented here. The results showed that, for the given demand model, the choice of optimisation approach had little effect on the quality of the outlier detection.
The bid price indicates the marginal difference between the value of selling a seat in the current time period and that of reserving it to sell in a future time period. The RM system only offers fare classes where the revenue from a booking exceeds the bid price. Thus, as an RM term, bid prices do not denote the customer’s bid but indicate the minimum price a fare class must carry to be included in the offer set. From those classes in the offer set, customers only consider the cheapest offer. Bid prices depend on the time until departure, unsold capacity, and expected demand. Note that in the examples given here, we consider a single capacity per leg, not differentiating, for example 1st or 2nd class compartments with separate capacities.
Booking patterns result as customers arrive and decide to book one of the offered fare classes. The firm does not report booking patterns for each individual itinerary, but only records them on the leg level.
The dynamic programme relies on a given set of expected demand arrival rates per leg l, fare class j, and time slice t of the booking horizon. In the simulation, we derive expected demand arrival rates from our knowledge of the underlying demand model. Arrival rates for each leg l and fare class j are given as
where \(\lambda _{i,o}(t)\) is the arrival rate of customers of type i requesting itinerary o, and \({\mathcal {O}}_l\) is the set of itineraries which include leg l. This creates an artificially accurate demand forecast. Deriving the demand forecast from the actual demand parameter values ensures that the estimation of revenue loss caused by undetected outliers is not affected by flawed forecasts (see Sect. 4.3). In practice, demand parameter values are not known but are estimated based on previously observed demand and timeseries forecasting. A recent survey of related research contributions can be found in Banerjee et al. (2020), while Fiig et al. (2019) represent an example of the ongoing discussion on the link between forecast accuracy and RM performance.
4.2 Forecast adjustments for outlier demand
One aim of identifying outlier demand in booking patterns is to support analyst adjustments in RM systems. Without such adjustments, offers would be optimised for a regular demand forecast and thereby not be fit for maximising revenue under outlier demand. This raises the difficulty of predicting the consequences of analyst adjustments throughout the network. As a step in this direction, we analyse a bestcase scenario, assuming that the adjustment is made with foresight before the start of the booking horizon. We compare the revenue under three different adjustments:

Adjustment 1 (conservative) Adjust only forecasts of affected singleleg itineraries. E.g. for an outlier creating additional demand for itinerary AC, increase the forecasts of itineraries AB and BC.

Adjustment 2 (aggressive) Adjust forecasts of all itineraries that include at least one of the affected legs. E.g. for additional demand for itinerary AC, adjust all itineraries, including either leg AB or leg BC—i.e. itineraries AB, AC, AD, AE, BC, BD, and BE.

Adjustment 3 (balanced) Adjust forecasts of affected singleleg itineraries and the clusterspanning itinerary; in this case, AE. E.g. for additional demand for itinerary AC, adjust itineraries AB, BC, and AE. The motivation for adjusting AE (ahead of other itineraries) is that, in general, this will be the most popular itinerary in the cluster.
These three adjustments are not the only choices available to analysts. However, they represent options that stretch across the spectrum of how fully network effects should be considered. Adjustments 1 (conservative, legbased adjustments only) and 2 (aggressive, all potential network effects) are the two extremes. Adjustment 3 (balanced) is a compromise, which is more conservative than Adjustment 2 but still identifies the itinerary most likely to be the source of outlier demand. Further options would be to include more than just the clusterspanning itinerary in an alternative to Adjustment 3, but this leaves another choice of which itineraries to prioritise. As a lower bound, we compute the revenue when no adjustment is made. As an upper bound, we implement an oracle adjustment, i.e. only adjusting the forecasts of affected itineraries. We compare the revenue as the level of outlier demand ranges from 60% to +60% of the average leg demand.
4.3 Experimental results: revenue benefits
Figure 10 shows the revenue generated by outlier demand for each of the three adjustments. We show the results for four of ten itineraries contained within these four legs in Fig. 4. The results for the other six itineraries are similar. Appendix 12 further details these results as well as results on adjustments after outlier detection.
When outlier demand affects all four legs in the cluster (Fig. 10a), any type of adjustment is always better than no adjustment. Besides the oracle, the best choice is Adjustment 3, i.e. the balanced approach, which adjusts the forecasts of the clusterspanning itinerary and the individual leg. Adjustment 3 is able to obtain, on average, 87% of the additional revenue gained under the oracle adjustment. Similar results are obtained when the outlier demand affects three legs (Fig. 10b).
When outlier demand affects only a singleleg itinerary (Fig. 10d), the conservative Adjustment 1 and the oracle adjustment coincide. The aggressive Adjustment 2 yields less revenue than no adjustment. For example although leg AB is correctly adjusted, the erroneous adjustment to itineraries AC, AD, and AE results in incorrect forecasts for legs BC, CD, and DE. The asymmetry between adjustment to positive and negative outlier demand is due to the level of demand being bounded from below by 0. Similar results emerge when the outlier affects only two of the affected legs (Fig. 10c), though the negative consequences of overadjusting all potentially affected itineraries are less severe, as this causes fewer superfluous adjustments.
The negative impact of adjusting unaffected itineraries highlights the importance of correctly clustering legs ahead of outlier detection. The closer the outlier demand itinerary is to the clusterspanning itinerary, the less risky it is to adjust all affected itineraries within a cluster, and the more benefit can be gained from doing so. From a managerial perspective, the best adjustment (other than the oracle) depends on the firm’s objective. To maximise revenue when the most common outlier (e.g. itinerary AE) occurs, the balanced Adjustment 3 is preferable. Conversely, if the objective is to minimise risk to revenue even in the more unlikely scenarios (e.g. an outlier in itinerary AB), conservative Adjustment 1 is preferable. Overall, however, there are clear benefits from forecast adjustment.
5 Empirical study
To demonstrate the practical applicability of the proposed clustering and outlier detection, we apply it to a set of empirical data obtained from Deutsche Bahn. This data set features only bookings of the 2nd class compartment, such that all bookings on one leg require capacity from the same compartment. The Deutsche Bahn longdistance network consists of over 1000 train stations, letting the provider offer more than 110,000 direct origin—destination combinations. The numbers grow further when accounting for alternative transfer itineraries and multiple daily departures. Figure 11 shows the empirical distribution of the number of legs included in itineraries that passengers booked in November 2019. Only 7% of passengers booked singleleg itineraries, whereas almost half of all booked itineraries span five or more legs.
5.1 Clustering legs in the Deutsche Bahn network
5.1.1 Small network subsection
First, we consider a section of the Deutsche Bahn railway network that consists of two intersecting train lines over a total of 27 stations and 28 legs—see Fig. 12. The red train arrives at the connecting stations before the blue train. Hence, the network offers three transfer connections: changing from red to blue at either Fulda, KasselWilhelmshöhe, or Göttingen. This creates 240 potential travel itineraries. For each leg in this network section, Deutsche Bahn records 359 booking patterns for departures between December 2018 and December 2019. Each booking pattern ranges over 19 booking intervals; the first observation occurs 91 days before departure.
We first apply the correlationbased clustering approach of Sect. 2.1, using a threshold of 0.5, such that only legs with a minimum correlation of 0.5 can be in the same cluster. In Fig. 12a, coloured bubbles indicate the four resulting clusters: Each train line splits into one large and one small cluster.
To evaluate clustering on empirical data, where the true underlying demand for each itinerary is unknown, we use the network topology to check whether the resulting clusters are plausible. To that end, we propose the following set of rules:

Different train lines must belong to different clusters. Even when passengers can transfer between lines, we expect relatively few passengers to make the same connection. Further, it makes sense to consider train lines separately for forecasting and analyst interventions.

Train lines are further split into separate clusters on either side of a major station. As many passengers leave the train at a major station and many different passengers board, we shall assume a relatively small proportion of passengers book itineraries that pass a major station. Similarly, given that itinerary demand share is driven by which journeys are most common, and passengers often either board or alight at a major station, it is intuitive to have a cluster that contains the legs between major stations.
Deutsche Bahn assigns an ordinal indicator of importance to each station, ranging from 1 to 7. We define a major station to be in Category 1. The entire Deutsche Bahn network includes 21 major stations, whereas the considered network section includes nine major stations. Figure 12b highlights major stations in grey and shows the clusters resulting from the above rules.
The correlationbased clustering returns four clusters, whereas the rulebased clustering returns nine. Nevertheless, the resulting clusters share similar features. Firstly, the two distinct train lines end up in different clusters in either approach. For legs in distinct train lines, correlation tends to be higher between legs that share a transfer station, but not to a convincing extent—the correlation is at most 0.22. A correlation threshold of 0.27 creates two clusters (one for each train line). Secondly, the breakpoints for the correlationbased approach are a subset of the breakpoints, i.e. major stations, in the rulebased approach. We conclude that the correlationbased approach achieves similar results as the rulebased approach without requiring expert input.
We can formally compare clustering results using the Normalised Mutual Information (NMI) (Amelio and Pizzuti 2015). The NMI is 1 if two clusterings are identical, and 0 if they are completely different.
Figure 13a shows the NMI between the correlation and rulebased approaches while varying the threshold in the correlationbased approach from 0 to 1. This shows that both approaches achieve similar results, with an NMI reaching 0.899. The approaches are generally more similar at higher correlation thresholds (around 0.7) since the rulebased approach generally creates more clusters. Figure 13b compares the number of clusters of the two approaches—as the correlation threshold changes, the number of clusters ranges from 1 (everything in a single cluster) to 28 (each leg in its own cluster), demonstrating the flexibility of the correlationbased approach.
5.1.2 Large network subsection
We extend the empirical study to five train lines to further demonstrate the complexity that considering the network structure brings to clustering and outlier detection, and show the scalability of the approach. The fiveline network consists of 40 stations with 63 legs. As shown in Fig. 14, there are often multiple train lines which cover the same leg or may travel in the opposite direction. As the larger size of the network makes visualisation more difficult, in Fig. 14, stations are represented by circles, with major stations highlighted in black.
Figure 14(a) shows the results of the correlationbased clustering with a default threshold \(\rho = 0.5\). This results in nine clusters, with two train lines each forming their own cluster containing all legs. The breakpoints of the clusters occur at major stations, as also previously seen for two train lines. The pattern of breaking clusters at major stations persists as the correlation threshold is varied. In comparison, the output of the rulebased clustering shown in Fig. 14(b) results in 24 clusters, with many being of size 1.
In these empirical studies, we applied rulebased clustering only to evaluate the plausibility of the results from correlationbased clustering. We do not advocate for it as a method in itself. A rulebased approach, where the clusters are based on domain experts’ categorisations, would not be able to respond to the evolving importance of stations across different train lines and departure times. Notably, the correlationbased method not only uncovers major stations but rather identifies legs where multileg itineraries cause similar booking patterns and thus could change and adapt over time. We further evaluate clustering performance in a simulation study, where the itinerarylevel demand is known, in Appendix 8. The results in the remainder of the paper rely on correlationbased clustering.
5.2 Detecting outliers in the Deutsche Bahn data
Having established clusters, we apply outlier detection independently to each cluster. To exemplify this on empirical data, we apply the outlier detection procedure to a representative fourleg cluster from the Deutsche Bahn network. Applying the proposed outlier detection approach to empirical data cannot precisely judge detection accuracy, given there are no labelled data on genuine outliers. However, this analysis demonstrates the full process of outlier detection on empirical data including, e.g. seasonality and underlines practical implications.
For this analysis, we consider a cluster of four legs from the Deutsche Bahn network with stations anonymised and denoted by A, B, C, D, and E. This cluster results from applying the correlationbased clustering to a new section of the Deutsche Bahn network to Fig. 12.
Figure 15 shows the booking patterns for each of the four legs; bookings are scaled to be between 0 and 1. From initial visual inspection, the structure of the booking patterns appears similar, with some obvious outliers appearing across multiple legs.
To preprocess the data for outlier detection, we transform the booking patterns by applying a functional regression model (Ramsay and Silverman 1997). We then apply the outlier detection to the residual booking patterns. In this preprocessing, we correct for three factors: (i) the departure day of the week; (ii) the departure month of the year; and (iii) the length of the booking horizon.^{Footnote 1}
The functional regression fits a mean function to the booking patterns for each different factor in the model. Table 9 in Appendix 13 compares models including different factors. Let \(y_{nl}(t)\) be the \(n^{th}\) booking pattern for leg l. Then:
where, e.g. \(\mathbbm {1}_{Mon_{nl}} =1\) if departure n relates to a Monday, 0 otherwise. In this model, \(\beta _{0l}(t)\) represents the average bookings for Sunday departures in December, with a regular length of booking horizon, and \(\beta _{pl}(t)\) for \(p>0\) represent deviations from this mean pattern. The \(\beta _{pl}(t)\) are functions of time, which allows for relationships between factors to evolve over the booking horizon. Given that functional depths are calculated independently for each leg, we apply the regression model independently for each leg. The resulting residuals are included in Appendix 14, Fig. 38.
Functional regression preserves the correlation between different legs, as verified in Appendix 19, Table 11b. The clustering approach can consider either the correlations between the booking patterns or the residual booking patterns. Given that the functional depth (the basis for the outlier detection) is calculated on the residuals, we suggest using the correlation between residual patterns to define the clusters. For this data set, the same clusters resulted in either case.
We calculate the functional depth of each booking pattern and compute the threshold as described in Sect. 2.2. We then transform the depths as per equation (2) to obtain \(z_{nl}\), as shown in Fig. 16. The sums of threshold exceedances, \(z_n\), were shown earlier in Fig. 2, with the empirical distribution and fitted generalised Pareto distribution shown in Figs. 3a and 3 b, respectively.
Figure 17 highlights the outliers detected in each leg in pink while depicting outliers detected in other legs but not in that leg in blue. Regular patterns are grey.
Of the 40 outliers (11% of departures) detected across all legs, 23 outliers (almost 60%) could be attributed to known events or holidays. When considering only the top 10 outliers, the percentage rose to 70%. A further departure detected as an outlier had been previously flagged by Deutsche Bahn. The firm implemented a booking stop to control sales on that departure for multiple connected legs. Appendix 18 provides further details on the distribution of identified outliers across legs.
6 Conclusion and outlook
In this paper, we proposed a twostep method for (i) clustering legs in a mobility network that could benefit from joint outlier detection and (ii) detecting outlying demand within such clusters. Furthermore, the proposed method, FD+Agg, ranks identified outliers according to their severity, creating an alert list to aid analysts in prioritising demand forecast adjustments.
The simulation study demonstrated the robustness of the method in a range of outlier demand scenarios. It highlighted that aggregating the analysis across clustered legs improves both detection rate and precision. Further, the ranked alert list often correctly identified the most critical outliers. The advantages of the proposed approach became particularly clear when benchmarking its truepositive rate, distribution of outliers across ranks, and precision, against that from a combination of principal component analysis and highdensity regions (PCA+HDR) from Hyndman et al. (2016), and on the nonranked, legbased method proposed in Rennie et al. (2021).
Furthermore, we implemented a simulated revenue management system to measure the potential revenue benefits of identifying and adjusting for demand outliers in a network setting by applying forecast adjustments across a cluster of legs. This analysis showed that taking into account, the similarity of the legs can improve revenue in most scenarios. In the less likely scenario where only one or two legs of a cluster are affected by outlier demand, riskaverse firms may prefer individual leglevel adjustments.
Finally, by applying the proposed approach to empirical booking data collected by Deutsche Bahn, we demonstrated its applicability and scalability to the type of data observed in practice. In particular, we used this analysis to showcase the expected cluster results and to demonstrate how to account for additional practical considerations, such as trend and seasonality. Note that once the clustering has been performed, the outlier detection can be performed in parallel within each cluster. Therefore, our methodology is scalable to a much larger data set, such as the entire Deutsche Bahn longdistance train network. Such an analysis is not included in this paper as, beyond giving excessive insight into confidential company data, the research insight to be gained from visualising even more complex network cutouts is limited.
The remainder of this section discusses design choices taken in the research documented here, related limitations, and open research challenges.
Leg versus itinerarylevel data: Our proposed method aggregates and analyses booking patterns from legs instead of itineraries based on three considerations. First, when an extensive network features many possible itineraries, most individual itineraries only receive a small share of bookings, challenging any data analysis—the study described in Appendix 11 evaluates such a case. Though the outlier detection may perform well if there are a sufficient number of bookings for a given itinerary, only considering such itineraries risks systematically ignoring outliers from smaller itineraries and feeder legs. Secondly, when offering many potential itineraries, providers rarely store all booking patterns per itinerary. For example capacitybased RM, as described in Strauss et al. (2018), frequently considers leg booking patterns to ensure capacity availability on each leg of a requested itinerary. Accordingly, the methodology proposed here is compatible with capacitybased RM. Finally, even in the idealised case of having large volumes of stored itinerarylevel data for every possible itinerary in the network, then running outlier detection algorithms quickly becomes computationally infeasible as the number of possible itineraries grows rapidly with the size of the network. Detecting outlying clusters of legs, rather than individual itineraries, overcomes all three challenges, as we have demonstrated in this paper. We do, however, note that the outlier detection methodology we propose could be applied directly to itinerary data without performing clustering. However, we only recommend this for densely booked itineraries, as otherwise, zeroinflated data can induce inferior results. We explored this further in Appendix 11.
Constrained versus unconstrained bookings: Observed bookings are constrained by any revenue management controls that were in place at the time of booking, whereas revenue optimisation models rely on unconstrained demand forecasts (Talluri and Van Ryzin 2004, Chapter 9.4). To represent this practice, we analysed constrained bookings in this paper and analysed the effect of adjusting unconstrained forecasts in the computational study. In that vein, further research could also consider the impact of applying the analysis to constrained observations, as showcased here, versus applying it to unconstrained demand estimates, which are frequently used for demand forecasting.
Implications for decision support: Further research is needed to consider the practical aspects of outlier detection from the perspective of decision support. Outliers manifest as changes in arrival rate, price elasticity, or other variables that affect bookings. Outliers can be caused by stochasticity but also by changes in demand patterns as a result of external factors, such as specific events. Complemented by further analysis, successful outlier detection could have three potential uses for RM: 1) detecting outliers early within the booking horizon through online analysis as proposed in Rennie et al. (2021), allowing for rapid interventions; 2) removing any detected outliers from training data for demand forecasting to improve results on predicting reference demand curves; and 3) if outliers can be attributed to specific events, the forecast model could be extended to include such events. Outlier detection can have broader benefits for operational planning in transportation networks, helping service providers to avoid overcrowding and delays. To realise such benefits, future research should particularly focus on effective ways to visualise outliers in networks and to communicate alert lists to planners. To further support analysts in their decisionmaking, additional measures could be included in the alert list. These might include average fare in the affected cluster, potential revenue loss if the outlier is not accounted for, or the outlier severity resulting from running the outlier detection procedure on revenue (instead of booking) patterns. An interesting avenue of further research would be to incorporate a feedback element whereby analysts mark outlier alerts as useful or not useful. A supervised learning approach, e.g. oneclass classifiers, could then be combined with our proposed outlier detection routine to filter out false alerts. Analysts could additionally include feedback on the quality of the clustering approach.
Clustering methodology: Investigating the use of alternative clustering approaches is of interest—especially where the clusters are likely to be of different structures compared to the rail industry, e.g. in the airline industry where hub and spoke networks are more common than lines. Whilst this paper relied on clustering to improve outlier detection, we believe that the clustering approach is a useful contribution in and of itself. For example clustering presents additional research avenues such as its application to improving networklevel forecasting; supporting the planning for future new stations; and evaluating how the transport network structure is changing over time or defining different travel zones. Finally, further research opportunities lie in considering how the success of network outlier detection depends on the network structure. This paper featured examples from transport, specifically railway networks. Other application areas of RM, such as hotels, where correlation is induced by bookings for multiple consecutive nights, feature sparser or structurally different service networks.
Data availability
The simulated data and source code that support the findings of this study are available from the corresponding author upon request. R functions for outlier detection are available at: https://github.com/nrennie/outlierdetectioninnetworkrevenuemanagement. The code also includes some examples of the simulated data. Functions for simulating demand can be found at https://github.com/nrennie/simnetdemand.
Notes
Deutsche Bahn offers a regular booking horizon of 6 months, with the first observation of bookings occurring around 3 months before departure. Due to schedule changes, shorter booking horizons of 3 months apply for departures from midDecember to midMarch.
References
Amelio A, Pizzuti C (2015) Is normalized mutual information a fair measure for comparing community detection methods? Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2015 pp 1584–1585
Azadeh SS, Labib R, Savard G (2013) Railway demand forecasting in revenue management using neural networks. Int J Revenue Manag 7(1):18–36
Banerjee N, Morton A, Akartunalı K (2020) Passenger demand forecasting in scheduled transportation. Eur J Op Res 286(3):797–810
Barrow D, Kourentzes N (2018) The impact of special days in call arrivals forecasting: a neural network approach to modelling special days. Eur J Op Res 264(3):967–977
Belobaba P (1987) Air travel demand and airline seat inventory management. PhD thesis, Massachusetts Institute of Technology
De Baets S, Harvey N (2020) Using judgment to select and adjust forecasts from statistical models. Eur J Op Res 284(3):882–895
Dubin JA, Müller HG (2005) Dynamical correlation for multivariate longitudinal data. J Am Stat Assoc 100:872–881
Fawcett L, Walshaw D (2007) Improved estimation for temporally clustered extremes. Environmetrics 18(2):173–188
Fawzy A, Mokhtar HM, Hegazy O (2013) Outliers detection and classification in wireless sensor networks. Egyptian Inf J 14(2):157–164
Febrero M, Galeano P, GonzálezManteiga W (2008) Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels. Environmetrics 19(4):331–345
Fiig T, Weatherford LR, Wittman MD (2019) Can demand forecast accuracy be linked to airline revenue? J Revenue Pricing Manag 18(4):291–305
Grimshaw SD (1993) Computing maximum likelihood estimates for the generalized pareto distribution. Technometrics 35(2):185–191
He G, Müller HG, Wang JL (2003) Functional canonical analysis for square integrable stochastic processes. J Multivar Anal 85(1):54–77
Hubert M, Claeskens G, De Ketelaere B, et al (2012) A new depthbased approach for detecting outlying curves. In: Colubi A, Fokianos K, GonzalezRodriguez G, et al (eds) Proceedings of COMPSTAT 2012, pp 329–340
Hyndman RJ (1996) Computing and graphing highest density regions. Am Stat 50(2):120–126
Hyndman RJ, Wang E, Laptev N (2016) LargeScale Unusual Time Series Detection. Proceedings  15th IEEE International Conference on Data Mining Workshop, ICDMW 2015 pp 1616–1619. https://doi.org/10.1109/ICDMW.2015.104
Kendall MG (1938) A new measure of rank correlation. Biometrika 30:81–93
Kimms A, MüllerBungart M (2007) Simulation of stochastic demand data streams for network revenue management problems. Or Spectrum 29(1):5–20
Klein R, Koch S, Steinhardt C et al (2020) A review of revenue management: Recent generalizations and advances in industry applications. Eur J Op Res 284(2):397–412
Kumar P, Khani A (2020) Evaluating special event transit demand: a robust principal component analysis approach. IEEE Trans Intell Transp Syst 22(12):7370–7382
Lawrence M, Goodwin P, O’Connor M et al (2006) Judgmental forecasting: a review of progress over the last 25 years. Int J Forecast 22(3):493–518
Leadbetter M (1991) On a basis for ‘Peaks over Threshold’ modeling. Stat Probabil Lett 12(4):357–362
Liang T, Liu H, Tan Y (2020) Research on the gravity planning model of prefecture city rail transit network. International Academic Exchange Conference on Science and Technology Innovation 2019 145(02005):493–518
Pearson K (1895) VII Note on regression and inheritance in the case of two parents. Proc R Soc Lond 58:87
Perera HN, Hurley J, Fahimnia B et al (2019) The human factor in supply chain forecasting: A systematic review. Eur J Op Res 274(2):574–600
Peter SJ, Victor SP (2010) A novel algorithm for informative meta similarity clusters using minimum spanning tree. Int J Computer Sci Inf Secur 8:1
Pickands J (1975) Statistical inference using extreme order statistics. Annals Stat 3(1):119–131
Prim R (1957) Shortest connection networks and some generalizations. Bell Syst Technol J 36:1389–1401
Qu H, Zhou H, Wu Y (2007) Controllable and progressive edge clustering for large networks. Graph Draw Lect Notes Computer Sci 45:4372
Quante R, Meyr H, Fleischmann M (2009) Revenue management and demand fulfillment: matching applications, models, and software. OR Spectrum 31(1)
Ramsay JO, Silverman BW (1997) Functional data analysis. Springer, New York
Ranshous S, Shen S, Koutra D et al (2015) Anomaly detection in dynamic networks: A survey. WIREs: Comput Stat 7(3):223–247
Rennie N, Cleophas C, Sykulski AM et al (2021) Identifying and responding to outlier demand in revenue management. Eur J Op Res 293:1015–1030. https://doi.org/10.1016/j.ejor.2021.01.002
Rennie N, Cleophas C, Sykulski AM, et al (2022) Analysing and visualising bike sharing demand with outliers. https://doi.org/10.48550/ARXIV.2204.06112, http://arxiv.org/abs/2204.06112
Ribatet M, Dutang C (2019) POT: Generalized Pareto Distribution and Peaks Over Threshold. https://CRAN.Rproject.org/package=POT, r package version 1.17
Schaeffer SE (2007) Graph clustering. Computer Sci Rev 1(1):27–64
Schütze C, Cleophas C, Tarafdar M (2020) Revenue management systems as symbiotic analytics systems: insights from a field study. Business Res 13(3):1007–1031
Smith RL (1985) Maximum likelihood estimation in a class of nonregular cases. Biometrika 72(1):67–90
Strauss AK, Klein R, Steinhardt C (2018) A review of choicebased revenue management: theory and methods. Eur J Op Res 271(2):375–387
Talagala PD, Hyndman RJ, SmithMiles K et al (2019) Anomaly detection in streaming nonstationary temporal data. J Comput Gr Stat 29:13
Talluri KT, Van Ryzin GJ (2004) The theory and practice of revenue management. Kluwer Academic Publishers, Amsterdam
Weatherford LR (2016) The history of forecasting models in revenue management. J Revenue Pricing Manag 15(3):212–221
Weatherford LR, Belobaba PP (2002) Revenue impacts of fare input and demand forecast accuracy in airline yield management. J Op Res Soc 53(8):811–821
Weatherford LR, Bodily SE (1992) A taxonomy and research overview of perishableasset revenue management: yield management, overbooking, and pricing. Op Res 40:831–844
Weatherford LR, Kimes SE (2003) A comparison of forecasting methods for hotel revenue management. Int J Forecast 19(3):401–415
Yang K, Shahabi C (2004) A PCAbased similarity measure for multivariate time series. MMDB 2004: Proceedings of the Second ACM International Workshop on Multimedia Databases pp 65–74. https://doi.org/10.1145/1032604.1032616
Zahn CT (1971) Graphtheoretical methods for detecting and describing gestalt clusters. IEEE Trans Computers C–20(1):68–86
Acknowledgements
We gratefully acknowledge the support of the EP/L015692/1 STORi Centre for Doctoral Training funded by the Engineering and Physical Sciences Research Council. The authors thank Deutsche Bahn for the provision of data and are particularly grateful to Philipp Bartke and Valentin Wagner for helpful discussions and suggestions.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Additional details of method
Appendix A provides additional details on the proposed method described in Sect. 2, including the specifics of the correlationbased minimum spanning tree clustering, and the calculation of the functional depths.
1.1 A.1 Functional dynamical correlation
Let \(y_{n,ij}(t)\) be the total observed bookings for the \(n^{th}\) departure on leg ij up to booking interval t, and similarly for \(y_{n,jk}(t)\). The functional dynamical correlation between the booking patterns \(y_{n,ij}(t)\) and \(y_{n,jk}(t)\) is:
where
and w(t) is a weight function that accounts for the time gap between observations. Here, \(y^{*}_{n,ij}(t)\) is a standardised version of \(y_{n,ij}(t)\):
where \(\mu _{ij}(t)\) is a mean function, and:
The functional dynamical correlation is then the average across all N departures:
1.2 A.2 Prim’s algorithm
Prim’s algorithm is a greedy algorithm with the following basic steps. Assuming the original graph G has V(G) vertices.

Initialise the MST, T, with the edge with minimum weight and the two vertices it connects. Let V(T) be the number of edges in T.

While \(V(T) < V(G)\):

go through the remaining edges in G in order from smallest to largest weights, until one is found that is connected to T, but does not form a circuit (i.e. the edge does not form a loop such that T is no longer a tree).

Add this edge (and the vertices it connects) to T.

More computationally efficient algorithms exist but given the reasonable size of the graphs considered, and more specifically their sparsity (very few stations are adjacent), the computational time is reasonable using Prim’s algorithm.
1.3 A.3 Functional depth
The functional halfspace depth is given by:
where, using \(t_{\tau +1} = t_{\tau } + 0.5(t_{\tau }  t_{\tau 1})\), the weights \(w_{\alpha }(t_j)\) are, according to Hubert et al. (2012):
where \(\alpha \in (0,0.5]\), with a default value of \(\alpha = 1/{T}\). The sample halfspace depth of a \(K\)variate vector \(x\) at time \(t_j\) is given by (Hubert et al. 2012):
1.4 A.4 Normalised mutual information
For a graph containing M legs, the mutual information between two clusterings \({\mathcal {A}}\) and \({\mathcal {B}}\) of the M nodes in the inverted graph is defined as:
where \(M_a\) is the number of nodes in the \(a^{th}\) cluster of clustering \({\mathcal {A}}\), and similarly for \(M_b\). The normalised mutual information (NMI) between two clusterings is defined as (Amelio and Pizzuti 2015):
where \(H({\mathcal {A}})\) is the entropy (a measure of uncertainty) defined as:
\(NMI({\mathcal {A}},{\mathcal {B}})\) = 1 if \({\mathcal {A}}\) and \({\mathcal {B}}\) are identical, and 0 if they are completely different.
Appendix B Details of computational study
Appendix B contains additional details of the simulation setup described in Sect. 3, including the computation of the bid prices, and a validation of the chosen parameter values.
1.1 B.1 Dynamic programming for bid price control
From Talluri and Van Ryzin (2004), let x be the remaining capacity, and define \(V_{t}(x)\) denote the value function at time t. Define R(t):
where \(r_j\) denotes the revenue from accepting a request for fare class j. The probability that \(R(t) = r_j\) is equal to the arrival rate for fare class j at time t. Note the arrival rates are such that at most one request arrives in each time period. Define:
We wish to maximise the combined revenue in the current time period, and the revenue to come in future time periods:
The Bellman equation for \(V_t(x)\) is:
where \(\lambda _j(t)\) is the arrival rate of demand for fare class j in interval t:
and \(\Delta V_{t+1}(x) = V_{t+1}(x)  V_{t+1}(x1)\) is the marginal cost of capacity in the next time period. The problem is solved with backwards recursion, with the following boundary conditions apply:
These ensure that (i) no revenue can be generated beyond the booking horizon, i.e after departure; and (ii) that no further revenue can be generated if there is no capacity remaining. The bid price at time t with remaining capacity x is given by \(\Delta V_{t}(x)\).
1.2 B.2 Details of benchmark method
We use the method proposed by Hyndman et al. (2016) as a benchmark comparison for our proposed method in Sect. 3. The method works as follows:

Define the total demand booking patterns as the sum of the demand for each leg within the cluster.

Compute f features of the n total demand booking patterns. Features include mean, variance, firstorder autocorrelation, trend, linearity, seasonality, peak, trough, entropy, lumpiness, spikiness, change in variance, and Kullback–Leibler score, among others. See Hyndman et al. (2016) for a full list.

Apply principal component analysis (PCA) as per Yang and Shahabi (2004) to determine the first two principle components, i.e. those that explain the most variance.

Use a densitybased multidimensional approach (Hyndman 1996) to find points in the first two principal components with the lowest density.

The nu points with the lowest densities relate to the departures which are classified as outliers.
1.3 B.3 Parameter values for simulation study
The parameter valued used to generate the demand in the computational study are outlined below.
1.3.1 B.3.1 Outliers considered in computational study
Table 3 shows the different experiments that were carried out as part of the computational study. We consider cluster outliers in which every itinerary within the cluster is equally affected; itinerary outliers where only a single itinerary within the cluster is affected; and station outliers which affect all itineraries that end at a particular station.
Appendix C Computational results
Appendix C includes the extended results from the computational study described in Sect. 3. Results from additional simulation experiments to test the proposed clustering approach are also presented here.
1.1 C.1 Evaluation of network clustering
For the correlationbased clustering to perform well it needs to (i) accurately estimate similarity between adjacent legs and (ii) use information about the pairwise similarity between adjacent legs to detect similarity between (potentially) more than two legs to form clusters. We use the proportion of total demand belonging to each itinerary to determine a clustering benchmark. For example in Fig. 18a, when all passengers travel the itinerary from A to E, the resulting bookings in each of the four legs would be identical. In this case, the correlation between legs would be 1—giving a single cluster of four legs.
To evaluate the clustering when the underlying demand is known, we define the common traffic ratio between two adjacent legs as the proportion of total demand that relates to itineraries over both legs. That is, for two legs ij and jk, we define the common traffic ratio, r(ij, jk), to be:
where \(D_{ij}\) is the demand for itinerary ij, and \(D_{ik}\) is the total demand for all itineraries which include both legs ij and jk. If all passengers book itineraries that traverse both legs, then \(r(ij,jk) = 1\). Conversely, if no passengers book journeys that traverse both legs, then \(r(ij,jk) = 0\).
We vary the level of demand for each itinerary to generate different benchmark clusterings. The output of the correlationbased clustering is then compared with benchmark clustering using the NMI. We consider three cases: The four legs belong in a single cluster (Fig. 18a); they belong in two clusters (Fig. 18b); and they belong in four clusters (Fig. 18c).

Case 1: When itinerary AE accounts for at least 50% of the network demand, we expect legs AB, BC, CD, and DE to belong to the same cluster, as they experience mostly the same demand. The remaining demand is calibrated across itineraries such that the total demand for each leg is reasonably uniformly distributed. We compare the correlationbased clustering with the benchmark clustering of all four legs in a single cluster, when the average percentage of demand on each leg from itinerary AE is 50%, 60%, 70%, 80%, 90%, or 100%. Figure 19a shows the fraction of total demand on each leg, from each itinerary, in the case where 60% of demand is for itinerary AE.

Case 2: We calibrate the majority of demand on leg AB and BC to be for itinerary AC, and the majority of demand on legs CD and DE to be demand for itinerary CE. For simplicity, the distribution of demand is symmetric across the four legs. We compare the performance when the average percentage of demand on each leg belonging to the clustering benchmark itinerary is 50%, 60%, 70%, 80%, 90%, or 100%. Figure 19b shows the case where 60% of demand on each leg is for the respective cluster itineraries (AC or CE).

Case 3: We calibrate the majority of demand on leg AB for itinerary AB, the majority of demand on leg BC for itinerary BC, and so on. We compare the performance when the average percentage of demand on each leg belonging to the leg itinerary is 50%, 60%, 70%, 80%, 90%, or 100%. Figure 19c shows the case where 60% of demand on each leg is for the itinerary consisting of only that leg.
The results are shown in Table 4.
In almost all cases, the normalised mutual information between the correlationbased clustering and the benchmark equals 1, indicating congruence. We now extend the simulation study by comparing the output of the correlationbased clustering under different correlation measures. In additional to the functional dynamical correlation measure described in Sect. 2.1, we compare Pearson correlation (Pearson 1895) and Kendall rank correlation (Kendall 1938). Let \(y_{n,ij}(t)\) be the observed bookings for the \(n^{th}\) departure on leg ij, and \(y_{n,pq}(t)\) analogous for leg pq.

Pearson correlation: calculate the Pearson correlation between corresponding booking patterns, then average across all booking patterns. That is, for the \(n^{th}\) of N booking patterns observed over T booking intervals, we calculate the Pearson correlation coefficient as:
$$\begin{aligned} \rho _n(ij, pq) = \frac{\sum _{t=1}^{T}(y_{n,ij}(t)  \mathop {\overline{y_{n,ij}}})(y_{n,pq}(t)  \mathop {\overline{y_{n,pq}}})}{\sqrt{\sum _{t=1}^{T}(y_{n,ij}(t)  \mathop {\overline{y_{n,ij}}})^2}\sqrt{\sum _{t=1}^{T}(y_{n,pq}(t)  \mathop {\overline{y_{n,pq}}})^2}} \end{aligned}$$(34)where \(\mathop {\overline{y_{n,ij}}}\) is the mean number of bookings for the \(n^{th}\) booking pattern. Then:
$$\begin{aligned} \rho (ij, pq) = \frac{1}{n}\sum _{n=1}^{N} \rho _n(ij, pq). \end{aligned}$$(35) 
Kendall rank correlation: observations \((y_{n,ij}(s), y_{n,pq}(s))\) and \((y_{n,ij}(t), y_{n,pq}(t))\) where \(s < t\), are concordant if their ordering agrees, and discordant otherwise. The Kendall rank correlation is defined between the \(n^{th}\) booking patterns in legs ij and pq as:
$$\begin{aligned} \rho _n(ij, pq) = \frac{t_c  t_d}{\sqrt{(}t_0  t_1)(t_0t_2)} \end{aligned}$$(36)where \(t_c\) is the number of concordant pairs, \(t_d\) is the number of discordant pairs, and \(t_0\), \(t_1\), and \(t_2\) are defined as follows:
$$\begin{aligned} t_0= & {} \frac{T(T1)}{2}, \end{aligned}$$(37)$$\begin{aligned} t_1= & {} \sum _{s} u_s(u_s  1) /2, \end{aligned}$$(38)$$\begin{aligned} t_2= & {} \sum _{t} v_t(v_t  1) /2, \end{aligned}$$(39)where \(u_s\) is the number of tied values in the \(s^{th}\) group of ties for in booking patterns for leg ij, and \(v_t\) is analogous for leg pq. Then:
$$\begin{aligned} \rho (ij, pq) = \frac{1}{n}\sum _{n=1}^{N} \rho _n(ij, pq). \end{aligned}$$(40)
We compare the cases where the correlation measure is (i) applied directly to the booking patterns and (ii) applied to the booking patterns where the withinbooking pattern relationships, e.g. trends have been removed. The normalised mutual information between the clustering produced by the correlationbased clustering under each of the different correlation measures and the benchmark clustering is shown in Table 5.
For case 1, all three correlation measures seem to be performing equally well, with the normalised mutual information almost always indicating congruence. For cases 2 and 3, the Pearson and Kendall correlation results in extremely poor performance in terms of NMI, with the benchmark clustering never being achieved. Functional dynamical correlation, however, continues to perform well with an NMI close to 1.
In order to determine why the Pearson and Kendall rank correlations initially appear to perform well in the single cluster case but fail in the two cluster case, we also compare the value of the correlation coefficient with the known demand share in a simple twoleg example. Consider the simple twoleg network shown in Fig. 20.
The common traffic ratio of legs AB and BC is:
If \(r(AB,BC) =1\), then the number of bookings on leg AB and leg BC is identical, and the correlation between them is 1. Conversely, if \(r(AB,BC) =0\), then the bookings on leg AB and leg BC are independent with correlation 0. Table 6 shows the estimates of the correlation, compared to the true ratio, r(AB, BC).
Functional dynamical correlation applied directly to the data performs best in all cases. In case 1, where the benchmark clustering is a single cluster, poor clustering performance can only result from underestimating the demand share. Both Pearson and Kendall rank correlation overestimate the correlation between booking patterns, even when the withinbooking pattern effects have been removed. This explains the good performance of Pearson and Kendall rank correlation in case 1, despite extremely poor performance in cases 2 and 3.
1.2 C.2 Detecting outliers in multiple legs
1.2.1 C.2.1 Results on detecting outliers in multiple legs
Table 7 provides the results shown in Fig. 5 in tabular format.
1.2.2 C.2.2 Outlier detection under alternative thresholds
We recognise that the percentage of departures that analysts are able to adjust strongly depends on the ratio of analysts to departures and that this is likely to be domaindependent. Therefore, here, we consider outlier detection performance as the functional depth threshold varies.
In terms of truepositive rates, the threshold choices of 0.01, 0.05, or 0.1 produce similar results, at least near the top of the alert list. Our method ranks the departures classified as outliers such that genuine outliers are more likely to be at the top of the ranked list, and false positives at the bottom of the list. Therefore, using a higher threshold tends to add more departures to the bottom of the list, and increase the risk of more false positives. As shown in Section 4 of the manuscript, the outliers that a threshold of 0.01 fails to detect tend to be small changes in magnitude. It is these small magnitude outliers that are added to the bottom of the list as the threshold increases. Notably, a threshold of 0.001 results in reduced performance even at the top of the list, suggesting that this would be too low a threshold. Similar results are seen in the change in precision (compared to the nonranked method with the same threshold).
A higher threshold does result in higher overall truepositive rates as more departures are classified as outliers. However, the maximum truepositive rate for a threshold of 0.05 results in around 1 in 5 departures being classified as outliers. This is quite a high percentage for them all to be considered outliers ( Fig. 21).
1.2.3 C.2.3 Distribution of outliers across multiple legs
In the scenario where all itineraries are equally affected, a high proportion of outliers should be detected in more than one leg. Figure 22a illustrates the proportion of outliers detected in 1, 2, 3, or 4 legs: More than half were detected in multiple legs. Figure 22b shows the proportion of true positives (genuine outliers which were detected), by the number of legs in which they were detected. In contrast with Fig. 22a, a much higher percentage of genuine outliers are detected in all four legs.
Given the clustering is correct, we expect an approximately equal number of singleleg outliers in each leg, as shown in Fig. 23b. If one leg, say DE, had not belonged in this cluster, we would expect a higher proportion of singleleg outliers to have been detected in leg DE. This could be utilised as a method for checking the clustering, after the outlier detection.
These results motivate aggregating threshold exceedances across legs in two ways: (i) Since less than 100% of genuine outliers were detected in all legs, if outlier detection was carried out only on the leg level, outliers could be missed on some legs. (ii) Given that a much higher proportion of outliers detected in four legs were genuine outliers, by ranking booking patterns detected in all legs as more likely to be outliers, we focus analysts’ attention on those more likely to be genuine outliers.
1.2.4 C.2.4 False discovery rate
The false discovery rate (FDR) is defined as the proportion of booking patterns classified as outliers which were false positives:
See Sect. 3.4 for definitions of true and false positives. Figure 24 shows the FDR for the case where outlier demand affects all itineraries, and the magnitude is randomly chosen from each of the distributions described in Sect. 3.2.
Figure 25 shows the FDR for each of the magnitudes of outliers considered in the simulation study. Given that smaller magnitude outliers are more similar to the regular demand, these result in higher false discovery rates.
1.2.5 C.2.5 Outliers affecting a single itinerary
Figure 26 shows the truepositive rate for the remaining itineraries in Fig. 7 of Sect. 3.4.
1.2.6 C.2.6 Outliers affecting a subset of itineraries
We consider a case where demand outliers affect only a subset of itineraries. Practical examples for this phenomenon could include trade fairs or conventions as well as regional crises. In such situations, demand towards (or from) a specific destination is most affected. Here, clustering offers additional benefits in guiding analysts towards those itineraries where they should adjust the forecast or controls.
We differentiate four scenarios based on the fourleg network described in Sect. 3, where events affect demand for itineraries travelling to stations B, C, D, and E, respectively. We expect analogous results when customers aim to travel home from events that happened at stations A, B, C, or D, respectively, given the symmetry of the demand parameters chosen for the computational study.
For each of the four possible events considered, we investigate the case where this generates 50% increase in average leg demand. For simplicity, we assume that these passengers are equally split between the itineraries which alight at the relevant station. Table 8 shows the resulting demand increases for each leg.
Figure 27a shows the truepositive rate for each of the cases. Although the event at E generates outliers in more legs, it is not the case that it has the highest truepositive rate. This shows that though the approach aggregates across legs, it does not ignore outliers only in a subset of those legs, provided they are sufficiently large. These effects may also be caused by interactions between the booking limits on different legs. For example in the case of an event at C, large increases in demand in legs AB and BC may cause booking limits to be reached earlier for these legs, which also limits bookings in itineraries such as AD and AE. Hence, an increase in demand for some legs may cause a decrease in bookings for different legs. By jointly considering multiple legs for outlier detection, we are able to detect the knockon effects of outliers even when the change in demand only affects a subset of legs. The change in precision can be interpreted similarly, in Fig. 27b.
Had we considered outlier detection on a legbyleg basis, the outliers were more likely to be missed in some of the legs. By combining information across legs, we are better able to determine which itineraries are affecting the volume of demand.
1.2.7 C.2.7 Limit alert list length via outlier severity thresholds
The results in this paper focus on limiting the length of the ranked alert list simply by the number of alerts it contains as this is most relevant to analysts. However, an alternative approach limits the length of the list by the outlier severity assigned to each departure. For example classifying a train as an outlier only if its outlier severity is above 80%.
Detection results when outliers affect all itineraries
Figure 28 shows the truepositive rate as the outlier severity decreases from 100% to 0%. Results are similar to those shown in Fig. 5a. Figure 29 shows the truepositive rate as the outlier severity decreases from 100% to 0%, for each magnitude of outlier considered. Results are similar to those shown in Fig. 8. Figure 30 indicates the truepositive rate for outliers that affect demand for a single itinerary.
Detection results when outliers affect a single itinerary
1.3 C.3 Detection results when demand is seasonal
In this section, we consider the case where demand is nonstationary across departures. In order to make the results of this simulation study more comparable with the other simulation studies included in the paper, we simulate 10 seasonal groups (somewhat analogous to months) as shown in Fig. 31. We then take 50 'days' from each of the 10 groups to produce 500 booking patterns per simulation, as in the other simulation experiments. All other parameters remain fixed in the simulation.
We then fit a functional regression model (similar to the model described in Sect. 5.2 for the empirical study) to remove the seasonality before performing the outlier detection. The quality of the outlier detection will highly depend on the model used to model the seasonal (and trend) components of the demand. A full study of the most suitable approach for modelling this type of demand is outwith the scope of this paper.
In comparison with the nonseasonal demand model, the truepositive rate is slightly lower. The false discovery rate is also slightly lower—see Fig. 32. Due to the seasonal variation the overall variance of the demand is higher which causes a lower functional depth threshold (for the same choice of percentile), meaning fewer observations are fall below the threshold.
1.4 C.4 Detection results using itinerarylevel data
The methodology described in this paper focuses on the application to leglevel booking data—since in capacitybased RM only leglevel capacities are required to be stored. Often, the itinerarylevel data are never recorded. However, if itinerarylevel data are available for analysis, one could apply the same approach (either with or without the aggregation step). The results of doing so are presented in this section.
Figure 33 shows the truepositive and false discovery rates from the alert lists generated for each itinerary. It shows that the outlier detection performs well in the most popular itinerary (the itinerary that defines the cluster—A\(\rightarrow\)E). The results are comparable to running the outlier detection procedure on the leglevel data. In the remaining nine itineraries with far fewer bookings, the outlier detection performs poorly, as there are too few bookings to detect a pattern.
Figure 34 shows that when the itinerarylevel outlier detection results are aggregated across itineraries, the overall performance is poorer than the leglevel aggregation. These results suggest that for itinerarylevel outlier detection, the better approach would be to identify important itineraries and run the outlier detection routines on those without aggregation. This, however, means that outliers that occur only in a small part of the network (which can cause knockon effects) would be systematically overlooked. To consider the full network, the legbased approach that accounts for itineraries through clustering performs better.
To further illustrate that the issue with itinerarylevel analysis is primarily caused by an insufficient number of bookings for lesspopular itineraries (making it difficult to detect patterns in the demand), we analyse the performance as the regular itinerarylevel demand varies. Figure 35 shows that performance (especially false discovery rate) is poorer when demand is especially low.
1.5 C.5 Revenue benefits from forecast adjustments
Figure 36 shows the truepositive rate for the remaining itineraries in Fig. 10 of Sect. 4.3.
The analysis in Sect. 4.2 constitutes a bestcase scenario in which we assume that, if outlier demand affects a particular leg, the outlier is detected in that leg. However, as we show in Sect. 3.4, even when demand outliers affect multiple legs, the outlier is not always detected in every leg due to noise. Therefore, we additionally compare different adjustments based on the output of the outlier detection, for an outlier in itinerary AE.

Adjustment A: Adjust only the forecasts of the affected singleleg itineraries for those legs in which the outlier is detected.

Adjustment B: Adjust the forecasts of the affected singleleg itineraries for those legs in which the outlier is detected, and the cluster spanning itinerary (AE).
We compare these both to making no adjustment and to the oracle adjustment. This is still a bestcase scenario to some extent, given that we assume the correct magnitude of adjustment is made.
Figure 37 shows the revenue under adjustments A and B (as described in Sect. 4.2) depending on the output of the outlier detection procedure. Combining adjustments on the leg level with those on the cluster level provides superior results in contrast to leglevel adjustments alone. Though making adjustments to only the singleleg itineraries may be riskaverse in the rare cases where an outlier affects only a small subset of the legs within a cluster, it may be detrimental to revenue when outliers affect multiple legs.
Appendix D Deutsche Bahn booking data
Appendix D contains an additional analysis of the empirical booking data from Deutsche Bahn.
1.1 D.1 Model selection for functional regression
Due to the functional nature of the data, in order to determine which of the factors result in a better fitting model, we use the CrossValidated Sum Of Integrated Squared Errors (CVSSE).
where \(\hat{y_{nl}(t)}\) is the prediction for the \(n^{th}\) booking pattern on the leg l, under the model fitted to all but the \(n^{th}\) booking pattern. The model which produces the lowest CVSSE is chosen as the best fitting. Note that unlike other model selection criteria (e.g. AIC), CVSSE does not take into account the number of parameters. Given that we are not interested in outofsample prediction, only in obtaining the best fitting model for our data, overfitting is not of great concern. The values of the CVSSE for each of the 12 models considered are shown in Table 9.
Across all legs, we find that day, month, and shortened booking horizons are all factors that must be taken into account. The inclusion of the days of the week as factors significantly reduces the CVSSE. In comparison, the inclusion of the booking horizon variable has a smaller, though still positive, effect. We compare two different approaches to accounting for the shortened booking horizon: (i) an indicator function (I) equal to 1 if the booking horizon is shorter, and (ii) a continuous variable (C) between 0 and 1 which gives the length of the shortened horizon as a proportion of the regular length horizon. Based on the CVSSE scores, shortened booking horizons are best represented by the indicator function, i.e. it is important to know that it is shorter but not by how much. The smaller effect of the horizon length variable may be related to the inclusion of the month variable, which is unsurprising given the overlap in the definition of these variables. The values of the CVSSE are similar for models 2 and 7, where we only consider one month or horizon length as a factor.
1.2 D.2 Residual booking patterns
Figure 38 shows the residual booking patterns resulting from the functional regression applied in equation (12) of Sect. 5.2. Compare with Fig. 15 of Sect. 5.2—the obvious outliers are preserved.
1.3 D.3 Functional linear regression coefficients
Table 10 shows an example of the coefficients for each explanatory variable over the 19 booking intervals, along with their standard errors. Not all days of the week, or months of the year, are significant throughout the entire booking horizon. Similarly, some days of the week have much stronger significance than others, e.g. Saturdays have a much larger impact than Tuesdays.
1.4 D.4 Functional depths
Figure 39 shows the functional depths for the empirical residual booking patterns, before the functional depths are transformed into the \(z_{nl}\), as shown in Fig. 16 of Sect. 5.2.
1.5 D.5 Probability plots for GPD and exponential distributions
Given that, if both \(\mu = 0\) and \(\xi = 0\), the GPD reduces to an exponential distribution, it is appropriate to compare the fit of the GPD with an exponential distribution to check if the inclusion of additional parameters is beneficial. Figure 40 shows the PP plots, i.e. the fitted theoretical CDF against the empirical CDF for the GPD (Fig. 40a) and the exponential distribution (Fig. 40b). The GPD provides a closer fit to the empirical data and the additional parameters better account for the shape of the distribution.
The GPD does not provide a perfect fit, with the probabilities in the bottom left of Fig. 40a on consistently being underestimated. However, given that we assume points with very low probability are more likely to be false positives, underestimating may actually be beneficial. Further, only the highlyranked outliers, i.e. those with high probability, are likely to be considered by an analyst due to time constraints. The GPD provides a very good fit for those data points. If there are a sufficiently large number of threshold exceedances, an empirical distribution could alternatively be used to compute the probabilities.
1.6 D.6 Distribution of outliers across multiple legs
The proportion of outliers found in each number of legs is shown in Fig. 41, with over half of the outliers detected in multiple legs. Compared with Fig. 22, this shows a similar proportion of outliers as found in the simulation study.
Figure 42a shows the proportion of total outlying booking patterns in terms of which legs they were detected as outliers in. Figure 42b shows the proportion in each leg of outlying booking patterns detected in one leg only. The proportions are fairly evenly split between the different legs. This reassures us that the correct clustering was chosen—if leg DE did in fact belong to a separate second cluster, we would expect a higher proportion of singleleg outliers to have been found in leg DE—compare with Fig. 23.
1.7 D.7 Simulation verification
In order to validate the parameter choices used to simulate booking patterns, we compare the resulting simulated booking patterns with the empirical booking patterns. We consider the standard deviation and mean of the bookings across the booking horizon of each in Fig. 43. Both the empirical and simulated booking patterns show a similar shape and magnitude of the relationship between the mean and standard deviation across the booking horizon.
We also compare the correlations between the different legs for both the empirical and simulated data. Table 11 shows the functional dynamical correlation between the empirical booking patterns and empirical residual booking patterns, for each leg. Table 12 shows the corresponding correlations between the simulated booking patterns. The values are similar and the rate of decay between legs as they get further apart follows a similar pattern.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rennie, N., Cleophas, C., Sykulski, A.M. et al. Outlier detection in network revenue management. OR Spectrum 46, 445–511 (2024). https://doi.org/10.1007/s00291023007142
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00291023007142