Outlier detection in network revenue management

This paper presents an automated approach for providing ranked lists of outliers in observed demand to support analysts in network revenue management. Such network revenue management, e.g. for railway itineraries, needs accurate demand forecasts. However, demand outliers across or in parts of a network complicate accurate demand forecasting, and the network structure makes such demand outliers hard to detect. We propose a two-step approach combining clustering with functional outlier detection to identify outlying demand from network bookings observed on the leg level. The first step clusters legs to appropriately partition and pool booking patterns. The second step identifies outliers within each cluster and uses a novel aggregation method across legs to create a ranked alert list of affected instances. Our method outperforms analyses that consider leg data without regard for network implications and offers a computationally efficient alternative to storing and analysing all data on the itinerary level, especially in highly-connected networks where most customers book multi-leg products. A simulation study demonstrates the robustness of the approach and quantifies the potential revenue benefits from adjusting demand forecasts for offer optimisation. Finally, we illustrate the applicability based on empirical data obtained from Deutsche Bahn.


Introduction and State of the Art
Revenue management (RM) is a challenging task for network service providers.
The concept entails controlling the set of product offers over a fixed sales horizon such that, given the predicted demand for the offers, the expected revenue from selling a limited capacity is maximal. To thus maximise revenue, the firm has to forecast the expected demand for all products that require capacity on mutual resources. Examples include transport itineraries that cross several network legs, and hospitality offers that combine room availability for multiple nights.
Several existing contributions, e.g. Weatherford and Belobaba (2002) and Rennie et al. (2021), demonstrate the negative effects of inaccurate demand forecasts on revenue performance but neglect network effects. Motivated by this, we propose a new approach to detect outliers in network bookings, thereby supporting forecast corrections for improved network revenue management.

Terminology
For simplicity, we employ a transport-based terminology throughout this paper: a leg describes a direct non-stop connection between two stations in a network, and an itinerary is any combination of legs that can be jointly booked as one product. A departure describes a journey along a connected series of legs that leaves the origin station at a unique time and date.
We denote the accumulation of bookings across the sales horizon as a booking pattern. We define an outlier as a booking pattern resulting from short-term systematic demand changes for one or several related network itineraries. These outliers occur when demand deviates from the baseline due to unforeseen events. For example, demand increases for a specific destination affect the entire itinerary. In consequence, deviations in the booking patterns are observable on the legs arriving at that destination and in the feeder legs.
Capacity-based revenue management differentiates offers through fare classes.
Fare classes describe combinations of fares and tariffs at which the firm offers a product. Customers booking a ticket for a specific departure may choose from several offered fare classes. For instance, the cheapest offer could be fare class 'M', costing 20 Euros and entailing a no-refund tariff.

Existing Work
RM is a well-studied problem for many different products and services (Talluri and Van Ryzin, 2004). Still, only recently the specific issues around network services and demand forecasting have come into focus. E.g. Klein et al. (2020) review how single-leg practices to RM generalise to the network setting. Weatherford (2016) surveys RM forecasting methods and focuses on airline itinerary-level forecasting.
So far, few authors have examined demand outliers in RM data. For historical hotel booking data, Weatherford and Kimes (2003) discuss a simple method of removing observations that are more than ±3σ away from the mean. Rennie et al. (2021) apply functional analysis to detect outliers on individual legs. Neither, however, consider outliers affecting multiple legs of a network. In Azadeh et al. (2013), the authors identify outliers in network railway bookings via a simple rule to remove them before forecasting future demand. For a slightly different perspective, Kumar and Khani (2020) analyse transit demand for outliers to detect special events. Notably, existing research on outlier detection frequently focuses on binary outlier detection without regard for quantifying how critical an outlier is.
Practical network RM relies on manual forecast adjustments (Quante et al., 2009;Schütze et al., 2020). Previous research has shown that the resulting judgemental forecasts can be biased and even superfluous (Lawrence et al., 2006;De Baets and Harvey, 2020). Perera et al. (2019) note that forecasting support tools can improve user judgement by reducing complexity for the analyst. Analysts' time is limited, so they cannot investigate every departure flagged as an outlier. For example, Deutsche Bahn experts estimate that they can reasonably adjust less than 1% of forecasts. Therefore, ranking outliers by criticality is crucial.
Beyond RM, Barrow and Kourentzes (2018) propose a functional approach for outlier detection in call arrival forecasting without regard for network effects. General outlier detection in networks often focuses on identifying outlying parts of the network. Fawzy et al. (2013) use this approach in wireless sensor networks to find faulty nodes. Ranshous et al. (2015) consider the extension to identify outlying nodes when the network changes over time. Most research on dynamic networks concentrates on analysing a single time series connected to each node rather than a set of time series, as required when booking patterns are reported for multiple departures. Hyndman et al. (2016) note that the problem of identifying unusual time series within a collection is not as extensively studied as other outlier detection problems. In this paper, we benchmark the approach suggested by Hyndman et al. (2016), which employs principal component analysis (PCA), against our newly proposed approach.

Contribution
We shall study booking patterns that result when customers book not just a single resource (leg) but network products that require multiple resources (itineraries).
Such booking patterns may be reported on the leg or the itinerary level -in this paper we assume that they are reported per leg and departure. This applies in the case of Deutsche Bahn, which serves as a motivation and empirical demonstration for the work presented here.
Network effects challenge outlier detection in two ways: On the one hand, demand outliers on the itinerary level affect bookings on all legs included in the itinerary. On the other hand, such outliers may not be recognisable when only considering leg bookings independently, given the noise from other itineraries overlapping those legs. As a result, directly extracting outliers from booking data collected in an entire, realistically sized network is likely an intractable problem. To circumvent this problem, in this paper, we aggregate and analyse booking patterns from legs instead of itineraries, as this allows for computationally and statistically tractable network-wide outlier detection. In Section 6, we further discuss the choice of leg-level vs itinerary-level-based analysis and point out how our procedures could be adapted to itinerary-level outlier detection.
Our network outlier detection procedure: (i) clusters legs with similar booking patterns and (ii) detects joint outliers within each cluster to compile ranked alert lists of outlying departures and affected legs. Our methodology significantly improves outlier detection performance in a network setting versus alternative methods.
In more detail, our proposed approach first clusters legs by measuring the similarity of booking patterns via functional dynamical correlation (Dubin and Müller, 2005). We suggest this measure for its freedom from restrictive assumptions. As the proposed approach is modular, other correlation measures could be used for the same end. In the second step, the proposed approach detects outliers from booking patterns within each cluster by combining the functional data analysis methods of Febrero et al. (2008); Hubert et al. (2012); Rennie et al. (2021) with a novel withincluster aggregation, which generates a ranked alert list of outliers using extreme value theory. This alert list can help analysts to identify the need for further analysis and adjustments. We consider an outlier as more critical if it indicates a larger demand shift and if it is identified across multiple legs. Factors such as the average fare on legs where outliers are detected, or the revenue at risk from faulty forecasts could also be incorporated into the definition of an outlier's criticality.
Finally, analysts have several choices when tasked with forecast adjustment for network services. The best choice is not obvious, and we further quantify the impact of different potential adjustments on revenue in a simulation study, following concepts outlined in Kimms and Müller-Bungart (2007).
In summary, this paper contributes (i) a method for identifying network legs that will benefit from joint outlier detection and (ii) a method to aggregate outlier detection across any number of legs to create a ranked alert list. To thoroughly evaluate the proposed approach, we offer (iii) wide-ranging simulation studies to benchmark the method's outlier detection performance against and to quantify the potential revenue improvements from forecast adjustments and (iv) a demonstration of applicability on empirical railway booking data from Deutsche Bahn.

Method
Several network products may rely on common resources when demand concerns multiple legs at once -in the transport example, even passengers that booked different itineraries often have to traverse the same legs. Therefore, specific legs share common outliers, as, for example, a sudden increase in demand from passengers travelling from one end of the network to attend an event at the other end would increase demand for each of the in-between legs. Neither considering each leg independently, nor jointly considering the whole network, will create the best results when the network spans multiple regions that differ strongly in demand -see Section 3.4 and Appendix C.2.6. This raises the question of which legs to consider jointly for outlier detection.
To find an answer, in Section 2.1, we adapt a method by Zahn (1971) to cluster legs such that (i) legs in the same cluster share demand and can be considered jointly for outlier detection, and (ii) legs in different clusters experience distinct demand and should be considered separately. Subsequently, in Section 2.2, we suggest a method for analysing bookings within one such cluster. Based on this, we propose a method to rank departures by the severity of identified outliers.

Clustering legs using correlation-based minimum spanning trees
To cluster legs based on correlations in observed bookings, we first consider the network as a graph where nodes represent the stations and edges represent the legs of a journey. Figure 1a illustrates this on a simple network. To illustrate the concept, we rely on an example from the transport domain: In this example, two train lines (red and blue) intersect at two stations (B and C). The red train arrives at stations B and C before the blue train, which creates two possible transfer connections for passengers: (i) switch from red to blue at B, (ii) switch from red to blue at C. Transfers from the blue to red train are not feasible. Standard graph clustering algorithms, as exemplified in Schaeffer (2007), seek to cluster the nodes of the graph. In contrast, we wish to cluster similar edges, which correspond to legs in the railway example ( Figure 1a). Hence, we invert the graph to make existing clustering algorithms applicable. In this inversion ( Figure   1b), the directed edges become nodes, e.g. the edge from A to B becomes node AB.
The inverted graph features an undirected edge between two nodes when: • both legs are in the same train line and share a common station, e.g. legs CD and DE are connected through station D, or • the legs are in different train lines but share a common transfer station where a connection is possible, e.g. leg FB (red line) and BC (blue line) are connected through station B. However, AB (blue line) and BC (red line) would not be connected by an edge as no connection can be made between them (as we have assumed the red train arrives at B and C before the blue train).
In theory, this transformation could also create edges between legs that share a common entry or exit node, e.g. FB (red line) and AB (blue line), or CG (red line) and CD (blue line). Given that such pairs of legs would never occur in the same itinerary, we would not expect demand outliers to affect both legs. Therefore, inserting an edge between them, potentially allowing them to be in the same cluster, is counter-intuitive. In addition, exploratory analyses of the empirical data found that correlations between these types of legs were very low across the entire network section.
The algorithm aims to assign those legs that experience similar bookings to the same cluster and those that experience dissimilar bookings to distinct clusters. A corresponding metric only needs to consider the similarity between adjacent legs that share a connecting station since edges do not otherwise exist in the inverted graph. We propose to quantify this similarity via the correlation between booking patterns.
To calculate correlations between booking patterns, we compute the functional dynamical correlation (Dubin and Müller, 2005). Functional dynamical correlation is based on calculating scalar products between pairs of smoothed booking patterns; the appendix provides further details. We use the average of these paired correlations over time as the similarity measure between two legs. Unlike more common statistical correlation measures, such as Pearson correlation, functional dynamical correlation does not assume a specific type of relationship between variables (e.g. linearity). It also accounts for the time dependency between observations within the booking horizon when the intervals between observations vary. For example, in the empirical RM data analysed in Section 5, the time between observations decreases as the departure date approaches. Further, alternative measures for calculating correlations from functional data (such as functional canonical correlation) often make restrictive assumptions, which real data does not fulfil (He et al., 2003).
We benchmark the clustering algorithm under alternative correlation measures in Appendix C.1.
To represent the relationship between legs in the network, i.e. the nodes in the inverted graph, we attach weights to the edges in the inverted graph. These weights are interpreted as distances: a higher edge weight indicates that the connected nodes are more dissimilar. Therefore, an applicable weight function should be non-negative. Further, the weight function needs to ensure that any negatively correlated legs are marked as more dissimilar. Even though a negative correlation may imply that outlier demand jointly affects both legs, we expect it to affect negatively correlated legs differently. Therefore, these require different adjustments from an analyst and should be in different clusters. To satisfy these requirements, we define the edge weights as: where ρ(ij, jk) is the correlation between bookings on legs ij and jk. Though the use of functional dynamical correlation as a measure of similarity between time series is not new, its application as an edge weight in a network setting, to our knowledge, is novel.
To allow for irregular cluster shapes, we recommend a minimum spanning tree (MST) algorithm (Prim, 1957). For example, in Figure 1b, a cluster may include AB and DE because they are in the same line, rather than clustering AB and FB.
Minimum spanning tree approaches work well for clusters with irregular boundaries (Zahn, 1971). Alternative clustering approaches (such as k-means) often assume a specific shape of clusters (spherical, for k-means). MST-based clustering approaches also do not assume that clusters are of similar sizes (Peter and Victor, 2010). This makes them particularly suitable for transportation networks constructed as a series of interlocking lines, where the points of intersection are often not equally spaced.
For example, MST-based approaches have previously been used in optimising layouts of railway networks (Liang et al., 2020).
A spanning tree of a graph is a subgraph that includes all vertices in the original graph and a minimum number of edges, such that the spanning tree is connected.
Then, the MST is the spanning tree with the minimum summed edge weights -see Figure 1c. Since the inverted graph is weighted, we use Prim's algorithm (Prim, 1957) to calculate the MST -Appendix A.2 provides a detailed introduction. Any one-to-one transformation of the weight function, w (ij,jk) will produce an identical minimum spanning tree.
There are two approaches to obtaining clusters from an MST: (i) pre-defining the number of clusters as k and removing the k − 1 edges with the highest weight; or (ii) setting a threshold for the edge weights and removing all edges with weights above some threshold, creating an emergent number of clusters. Here, we implement the threshold-based approach, ensuring that each cluster has the same minimum level of correlation. In contrast, setting the number of clusters in advance could result in very heterogeneous levels of correlation across clusters. Further, setting k too low may result in legs with dissimilar features being grouped together. We apply a threshold correlation of 0.5 -the level at which legs are more correlated than they are not. This corresponds to a transformed edge weight of 0.5. In the example given in Figure 1c, this means removing all legs with a weight above 0.5, resulting in the three clusters shown in Figure 1d. The choice of this clustering threshold will impact the number of alert lists produced. Therefore, we recommend considering factors such as staffing resources and any current (informal) network clustering when choosing this threshold.
While the outlier detection procedure described next applies to individual clusters, it does not require a particular clustering approach. Hence, other implementations may employ alternative approaches, as reviewed in Schaeffer (2007). In particular, depending on the business context of the network service, alternative clustering algorithms may be more appropriate. The network topology should drive the choice of which clustering algorithm is most appropriate: That topology may differ, e.g., when considering airlines versus bike rentals versus railways, as discussed in Rennie et al. (2022). The choice of an MST-based approach, which often returns linear clusters, is appropriate for the railway application motivating the paper at hand, given the linear nature of the underlying network structure. We further evaluate the performance of MST clustering in this application in Appendix C.1.
Furthermore, edge-based clustering could replace the graph inversion and nodebased clustering presented here. However, literature on edge-based clustering is far more limited, and such approaches tend to improve the visualisation of networks with a very high number of edges by reducing the number of edge crossings rather than grouping together the most similar edges (Qu et al., 2007). In contrast, inversion and node-based clustering aim to group network legs that exhibit the highest degree of similarity. However, alongside these advantages, there may be some drawbacks. The node-based approach requires deciding on criteria to select edges to include in the inverted graph.

Detecting outliers in clusters of legs
Given established clusters, we propose identifying demand outliers within each cluster and quantifying their severity to provide a ranked alert list of departures. The previously described clustering allows for processing the outlier detection in parallel for separate clusters, enabling efficient computing.
To identify which departures to include in the alert list, we consider the functional depth of the booking patterns, as in Rennie et al. (2021). This step could also rely on other measures of exceedance, including univariate "threshold" approaches, which look at aggregated bookings and ignore the distribution of bookings over time. We propose to rely on functional depth, as previous work has found this to be the most effective as an outlier detection mechanism (Rennie et al., 2021).
To compute the functional depth, consider N departures observed over L legs.
Let y nl = (y nl (t 1 ), . . . , y nl (t T )) be the booking pattern for the n th departure on leg l, observed over T booking intervals t 1 , . . . , t T . Let Y l be the set of N booking patterns for leg l. For each leg and departure, calculate the functional depth (d nl ) given the related booking patterns following the approach given in Hubert et al.
(2012) and detailed in Appendix A.3. The functional depths take on positive values, with smaller values of the depths relating to more outlying booking patterns.
For each leg l, we calculate a threshold for the functional depth using the approach of Febrero et al. (2008). This method (i) resamples the booking patterns with probability proportional to their functional depths (such that any outlying patterns are less likely to be resampled), (ii) smooths the resampled patterns, and (iii) sets the threshold C l as the median of the 1 st percentiles of the functional depths of the resampled patterns. Here, we use the 1 st percentile of the depths as the default threshold, as this has been found to work well in practice (Febrero et al., 2008;Rennie et al., 2021). Booking patterns with a functional depth below the threshold C l are classed as outliers. We explore alternative threshold choices in Appendix C.2.2.
To create ranked alert lists, we first define z nl to be the normalised difference between the functional depth and the threshold: This transforms the depth measure d nl into a measure of threshold exceedance.
Values of z nl greater than zero relate to booking patterns classified as outliers.
Normalising by the threshold, C l , ensures the values of z nl are comparable between different legs.
Next, we define the sums of threshold exceedances across legs: We sum only those values of z nl that are greater than zero to avoid outliers being  To create a ranked list of outlier departures, i.e. those with a non-zero-sum of threshold exceedances, we assign a severity θ n . A higher value of θ n indicates the departure is more likely to be affected by extreme outlier demand and hence should be targeted first by RM analysts.
To model threshold exceedances, we turn to extreme value theory (EVT) -a branch of statistics that deals with modelling rare events occurring in the tails of a distribution. Given that outliers are unusual events, which occur in the tails of distributions, EVT is a clear direction to turn to for modelling outliers -see Talagala et al. (2019). There are two common approaches to EVT: (i) block maxima, which examines the maximum value in evenly-spaced blocks of time, e.g. annual maxima, and (ii) peaks over the threshold, which examines all observations that exceed some threshold (Leadbetter, 1991). The generalised Pareto distribution (GPD) is commonly used to model the tails of distributions in the peaks over threshold approach (Pickands, 1975). Motivated thus, we fit a generalised Pareto distribution (GPD) to the sum of threshold exceedances given in equation (3). The GPD has three parameters with probability density function: Here, µ specifies the location, σ the scale, and ξ the shape of the distribution.
We fit the parameters using maximum likelihood estimation (Grimshaw, 1993) via the R package POT (Ribatet and Dutang, 2019). A kernel density estimate of the empirical distribution of z n > 0 from Figure 2 is shown in Figure 3a. The resulting fitted GPD is shown in Figure 3b. As the further analysis in Appendix D.5 shows, the GPD fit appears reasonable compared to the empirical distribution.
Two common issues arise in fitting GPDs: (i) the choice of threshold and (ii) the independence of the data points. When the threshold is too low, the assumption of a GPD no longer holds; when it is too high, there are too few data points to fit.
We select a threshold of 0, i.e. we fit the GPD to values of z n > 0. Rather than change the threshold at the GPD level, we control the number of observations the GPD is fitted to by varying the percentile used for the individual leg thresholds, C l . We choose C l as suggested by Febrero et al. (2008) and find that this choice works well and provides sufficient outlying points to fit a GPD in both simulated and empirical data.
To account for the second issue, applications of extreme value theory frequently first decluster the peaks over the threshold to ensure independence between observations (Fawcett and Walshaw, 2007). To that end, the analysis may only consider it is theoretically possible that observed outliers may be dependent; e.g. increased demand caused by Easter affects not only Easter Sunday but also the surrounding days. However, similar outliers may also result from independent events. As we aim to identify outlying departures rather than the underlying events, this argument causes us not to decluster here.
We define θ n as the non-exceedance probability given by the CDF of the GPD: Formally, θ n is the probability that, given an outlier occurs, the sum of threshold exceedances is at least as large at z n . Thus, it is not the probability that a departure is an outlier. However, we use this non-exceedance probability as a measure of outlier severity on a scale of 0 to 1.
Departures with functional depths that do not fall below the threshold on any legs carry a severity of zero, i.e. they are classified as regular departures. It is conceivable to estimate the uncertainty of θ n (Smith, 1985) to determine further levels of criticality, e.g., if there are several departures with the same outlier severity, the one with the smallest uncertainty would be ranked first. However, given the continuous nature of the data, it is unlikely that multiple departures carry an identical severity. Hence, we leave uncertainty estimation to future research.
From the severity defined in equation (6), we construct a ranked alert list con-taining all departures with a non-zero outlier severity. Although functional depth could be directly used to construct the ranked alert list, computing the severity provides a measure of the difference between ranks and is more easily interpreted by analysts. The top 8 ranked outliers relating to Figure 2, are shown in Table 1. exceed the threshold only to a small degree have lower but strictly non-zero severity.

Ranking
These outliers are most likely false positives and potentially waste analysts' time.
Hence, we suggest limiting the length of the list in practice.
To limit the length of the alert list, we might (i) only include departures if their severity is above some threshold or (ii) set a maximum length. Since we wish to control the number of alerts an analyst will receive, we analyse outlier detection performance as dependent on the maximum length of the alert list. Recall that we classify departures as outliers if and only if their outlier severity exceeds zero. Therefore, if the required length of the alert list exceeds the number of identified outliers, we do not include further departures. Appendix C.2.7 features further results on the outlier detection performance when varying the outlier severity threshold.

Outlier Detection Performance
We first implement a simulation study to evaluate the outlier detection performance given known outliers. By varying the demand for itineraries in one cluster, we create outliers that are observable on both the leg and network levels.
The simulation models a network consisting of 5 stations and 4 legs, as shown in  at time t, is given by: Here, φ io is the fraction of customers of type i and D o ∼ Gamma(α o , β o ) with probability density function: where a io and b io define are the parameters of a Beta distribution which defines how customers arrive over time. We generate demand over a horizon of 3,600 time slices to ensure λ i,o (t) < 1. This level of detail is required to accurately parameterise the dynamic program for bid price control. The resulting bookings are aggregated into 18 booking intervals.
As in Rennie et al. (2021), we consider differentiated demand from two customer types represented by the set I = {1, 2}. We assume that customers book the cheapest available fare class and differ in price sensitivity. We define p ijo as the probability that a customer of type i pays up to fare class j on itinerary o. By combining demand from two customer types that differ in price sensitivity with offers that depend on the current set of offered classes, we mimic a realistic price effect: Offer prices result from the cheapest class currently offered by the firm, as customers will buy the cheapest available class. When a customer's willingness to pay does not equal or exceed the price of the cheapest available class, they do not buy; hence, their price sensitivity translates to decreased demand. Note that the price-demand response depends on the itinerary and time in the booking horizon.
Combining this demand model with the given network creates 210 demand parameters. We validate that the functional dynamical correlation between the four legs for simulated data is comparable to empirical railway data as detailed in Appendix D.7. We generate all regular demand based on these parameters.
The simulation excludes trend and seasonality to evaluate outlier detection approaches in a best-case scenario. In other words, if an algorithm fails on observations from stationary demand, it will likely not perform better given more demand variability. However, additional results based on simulation data that does feature seasonality can be found in Appendix C.3.

Outlier generation and evaluation
We generate demand volume outliers by changing the Gamma distribution parameters that govern the total demand level according to equations (7) and (8). Previous work found the proportion of outliers had little effect on outlier detection performance in the single-leg case (Rennie et al., 2021). Therefore, we generate booking patterns for 500 departures per demand setting, with 1% of departures experiencing outlier demand. That is, we generate 495 departures from the regular demand distribution and 5 outliers from a set of twelve outlier distributions where the mean has shifted by ±10%, ±20%, ±30%, ±40%, ±50%, and ±60%. For every shift in the mean, we reduce the variance of the outlier demand distribution by 80%. This still results in an overall increase in the variance of total demand in the presence of outliers but also ensures that we sample sufficiently outlying demand values.
Outliers may also occur due to factors such as changes in arrival times or changes in customers' willingness to pay. Rennie et al. (2021) provide results on how the performance of functional depth varies under these different types of outliers. Here, we focus on the different types of outliers caused by varying network effects. In all cases, we consider the application of the outlier detection procedure to the constrained demand -applying the approach directly to the booking patterns without applying any unconstraining approaches first. The problem of unconstraining is one of the major challenges of demand forecasting for revenue management and beyond the scope of this paper.
We differentiate outlier scenarios in terms of the affected network components.
Firstly, we evaluate a scenario where outlier demand affects all network itineraries.
We consider the case where each outlier is randomly drawn from one of the twelve outlier distributions, resulting in outliers from a mixture of different distributions.
This lets us test whether the ranking of the alert list mirrors the outliers' underlying degree of demand deviation. Then, we consider each of the twelve outlier distributions in isolation to assess the detection sensitivity. Secondly, we evaluate a scenario where outliers only affect a single itinerary. This evaluates the benefits of clustering multiple legs.
In Appendix C.2.6, we consider the practically relevant case of outliers affecting a subset of itineraries and provide further details on all simulation experiments.
Each combination of outcomes can be classified into one of four categories: (i) assigning a non-zero outlier severity to a genuine outlier creates a true positive (TP); (ii) assigning a zero outlier severity to a regular observation creates a true negative (TN); (iii) assigning a non-zero outlier severity to a regular observation creates a false positive (FP); (iv) assigning a zero outlier severity to a genuine outlier creates a false negative (FN). This classification enables us to compute the true positive rate (TPR) for the top R ranked departures in the alert list: where T P R is the number of true positives in the top R departures. The true positive rate lies between 0 and 1, where 1 means all genuine outliers were identified. We evaluate performance across 1,000 stochastic simulations.
In an ideal setting, the alert list should feature, from top to bottom, large outliers and, subsequently, smaller outliers. Therefore, we also use the distribution of outliers within the ranked alert list to evaluate how well the method ranks the most critical outliers.

Comparison with PCA+HDR
This benchmark (i) computes features (e.g. mean, variance, curvature) of the booking patterns for the total demand in a cluster; (ii) uses PCA (Yang and Shahabi, 2004) to identify the first two principle components from the features; and (iii) uses HDR, a density-based approach (Hyndman, 1996), to find the ν points with the lowest density in the first two principal components. These points are classified as outliers. Extended details of the method, including the list of features, can be found in Appendix B.2. This method provides an ordering of the outliers but not a severity measure, as illustrated by Figure 5.

Comparison with non-ranked, single-leg approaches
To highlight to critical features of FD+Agg, we benchmark (i) the use of severity measures to rank outliers and (ii) the inclusion of network effects. To isolate the effects of each of these features, we perform two separate benchmark tests: We evaluate the effect of ranking outliers by measuring the increase in precision when ranking outliers. For example, we consider the precision in the top 5 ranked departures versus 5 randomly chosen departures with non-zero outlier probabilities (i.e., as in Rennie et al. (2021)). The change in precision when considering the top R departures, ∆(P recision) R , is given by: where T P R(random) is the number of true positives in a random selection of R departures with non-zero severity, and F P R(random) is defined analogously for false positives.
We quantify the value of accounting for network effects by computing ranked alert lists for each leg in isolation. We then compare the true positive rates to the aggregated, network-driven approach presented in this paper.

Detecting outliers in multiple legs
As a first experiment, we consider the scenario where outlier demand equally affects all itineraries and legs within the cluster. For this scenario, Figure

PCA+HDR benchmark results
The PCA+HDR approach requires a given number of outliers to detect, ν, as input.
Therefore, we compare the performance of the benchmark method under different choices of ν to FD+Agg.  Under FD+Agg, the modes of the distributions generally fall where they should, as larger outliers are ranked higher. The smaller variance in the ranking of the larger magnitude outliers indicates that they are easier to detect. The higher variance of the medium-sized outliers can be explained as the ranking of a medium-sized outlier is dependent on which other types of outliers occur: if there is a large and a medium outlier, the medium outlier is ranked lower; if there is a small and a medium outlier, the medium outlier is ranked higher. The distribution of outliers detected by PCA+HDR, shown in Figure 5d, also has the modes in the correct order. However, there is much more overlap between the distributions, showing its inability to correctly rank the outliers. The increase in precision from applying our method compared to PCA+HDR is similar to the increase in precision from the inclusion of the ranking (see Figure   6b). This suggests that PCA+HDR performs reasonably well in terms of outlier detection, but poorly in terms of ranking the outliers. Figure 7 shows the true positive rate when a ranked alert list is computed for each leg in isolation versus in the proposed aggregated manner. Here, we consider outlier demand generated by a 50% increase in the affected legs as an illustrative example. We analyse detection performance by breaking down results in terms of which itinerary the outlier demand is generated in. We show only the results relating to itineraries AB, AC, AD, and AE. Figure 26 in Appendix C.2.5 details results for the further itineraries yielding similar conclusions.

Comparison with single-leg approach
For results when outlier demand is generated across combinations of itineraries, refer to Appendix C.2.6.
In all cases, the true positive rate for clusters is higher than in any of the individual legs. This is because when considering the leg's bookings in isolation under outlier demand that affects multiple legs, the noise from other itineraries prevents detecting the outlier in every leg. However, clustering increases the number of detected genuine outliers. Aggregation is most beneficial when the outlier demand affects the most legs. In our example, this applies when itinerary AE experiences outlier demand, as shown in Figure 7a. The lower true positive rates in legs AB and DE result because different combinations of itineraries also utilise these legs. The aggregation is less beneficial when outlier demand affects an itinerary consisting of only one or two legs since we aggregate the analysis across legs that are actually not affected by outlier demand. However, there is a modest gain in true positive rate even in this case -compare Figure 7(c). This is due to the knock-on effects of decreased capacity on the affected legs, impacting the bid prices for any itineraries which include these legs. For some lengths of the alert list, the leg-level true positive rates are higher than the aggregated approach, due to false positives from unaffected legs being included in the list. However, even for itinerary AB (Figure 7d), where false positives from unaffected legs are most likely, the difference is small and cancelled out by the overall increase in true positive rate.

Sensitivity to different magnitudes of outliers
To better understand outlier detection performance, we break down the results by the magnitude of outliers in Figure 8. When outliers result from minor changes Given the significant overlap between the distribution of outlier demand with a 10% change in magnitude and that of regular demand, this is to be expected. Therefore, 10% demand changes effectively provide a lower bound on how big an outlier needs to be in order to be detected.
As the magnitude of the outliers increases, they become easier to detect and true positive rates are higher, with peak rates reached with shorter alert lists. Thus, genuine outliers are more likely to be ranked higher when they are caused by larger demand changes. For demand decreases of at least 50%, the true positive rate is very close to the optimal detection rate. Negative demand outliers are slightly easier to detect than positive demand outliers, meaning shorter alert lists are required.
This is due to the demand censoring imposed by the booking controls and capacity restrictions.

Simulation Study: Forecast Adjustments
To evaluate the implications of adjusting the demand forecast for further planning steps, we simulate network demand and the optimisation of offered fare classes over the booking horizon. We list and explain all parameters determining the settings in the simulation study in Appendix B.3. In this section, we first detail how the simulated RM system uses the demand forecast to compute revenue-optimal offers based on bid prices. In that, it follows a widely implemented industry standard.
Subsequently, we describe alternative strategies that analysts may apply to adjust demand forecasts based on identified outliers. Finally, by comparing revenue gained from offers based on different adjusted demand forecasts under the same simulated outlier demand, we highlight the effects of adjustments as dependent on outlier scenarios.

Network revenue management system
The simulated RM system controls the offered set of fare classes per itinerary to optimise expected revenue. To that end, it implements a dynamic program to compute bid prices per leg and sums them up per itinerary following the methodology described in Strauss et al. (2018) and detailed in the appendix. To test for the sensitivity of results with regard to the revenue optimisation, we compared two industry standards, the leg-based EMSR heuristic as introduced in Belobaba (1987) and dynamic programming in initial simulations studies not further documented here.
The results showed that, for the given demand model, the choice of optimisation approach had little effect on the quality of the outlier detection.
The bid price indicates the marginal difference between the value of selling a seat in the current time period and that of reserving it to sell in a future time period. Booking patterns result as customers arrive and decide to book one of the offered fare classes. The firm does not report booking patterns for each individual itinerary, but only records them on the leg level.
The dynamic program relies on a given set of expected demand arrival rates per leg l, fare class j, and time slice t of the booking horizon. In the simulation, we derive expected demand arrival rates from our knowledge of the underlying demand model. Arrival rates for each leg l and fare class j are given aŝ where

Forecast adjustments for outlier demand
One aim of identifying outlier demand in booking patterns is to support analyst adjustments in RM systems. Without such adjustments, offers would be optimised for a regular demand forecast and thereby not be fit for maximising revenue under outlier demand. This raises the difficulty of predicting the consequences of analyst adjustments throughout the network. As a step in this direction, we analyse a bestcase scenario, assuming that the adjustment is made with foresight before the start of the booking horizon. We compare the revenue under three different adjustments: • Adjustment 1 (conservative): Adjust only forecasts of affected single-leg itineraries. E.g. for an outlier creating additional demand for itinerary AC, increase the forecasts of itineraries AB and BC.
• Adjustment 2 (aggressive): Adjust forecasts of all itineraries that include at least one of the affected legs. E.g. for additional demand for itinerary AC, adjust all itineraries, including either leg AB or leg BC -i.e., itineraries AB, AC, AD, AE, BC, BD, and BE.
• Adjustment 3 (balanced): Adjust forecasts of affected single-leg itineraries and the cluster-spanning itinerary -in this case, AE. E.g. for additional demand for itinerary AC, adjust itineraries AB, BC, and AE. The motivation for adjusting AE (ahead of other itineraries) is that, in general, this will be the most popular itinerary in the cluster.
These three adjustments are not the only choices available to analysts. However, they represent options that stretch across the spectrum of how fully network effects should be considered. Adjustments 1 (conservative, leg-based adjustments only) and 2 (aggressive, all potential network effects) are the two extremes. Adjustment 3 (balanced) is a compromise, which is more conservative than Adjustment 2 but still identifies the itinerary most likely to be the source of outlier demand. Further options would be to include more than just the cluster-spanning itinerary in an alternative to Adjustment 3, but this leaves another choice of which itineraries to prioritise. As a lower bound, we compute the revenue when no adjustment is made. As an upper bound, we implement an oracle adjustment, i.e., only adjusting the forecasts of affected itineraries. We compare the revenue as the level of outlier demand ranges from -60% to +60% of the average leg demand. Figure 10 shows the revenue generated by outlier demand for each of the three adjustments. We show the results for four of ten itineraries contained within these four legs in Figure 4. The results for the other six itineraries are similar. Appendix C.5 further details these results as well as results on adjustments after outlier detection.  When outlier demand affects all four legs in the cluster (Figure 10a), any type of adjustment is always better than no adjustment. Besides the oracle, the best choice is Adjustment 3, i.e., the balanced approach, which adjusts the forecasts of the cluster-spanning itinerary and the individual leg. Adjustment 3 is able to obtain, on average, 87% of the additional revenue gained under the oracle adjustment. Similar results are obtained when the outlier demand affects three legs (Figure 10b).

Experimental Results: Revenue Benefits
When outlier demand affects only a single-leg itinerary (Figure 10d), the conservative Adjustment 1 and the oracle adjustment coincide. The aggressive Adjustment 2 yields less revenue than no adjustment. For example, although leg AB is correctly adjusted, the erroneous adjustment to itineraries AC, AD, and AE results in incorrect forecasts for legs BC, CD, and DE. The asymmetry between adjustment to positive and negative outlier demand is due to the level of demand being bounded from below by 0. Similar results emerge when the outlier affects only two of the affected legs (Figure 10c), though the negative consequences of over-adjusting all potentially affected itineraries are less severe, as this causes fewer superfluous adjustments.
The negative impact of adjusting unaffected itineraries highlights the importance of correctly clustering legs ahead of outlier detection. The closer the outlier demand itinerary is to the cluster-spanning itinerary, the less risky it is to adjust all affected itineraries within a cluster, and the more benefit can be gained from doing so. From a managerial perspective, the best adjustment (other than the oracle) depends on the firm's objective. To maximise revenue when the most common outlier (e.g. itinerary AE) occurs, the balanced Adjustment 3 is preferable. Conversely, if the objective is to minimise risk to revenue even in the more unlikely scenarios (e.g. an outlier in itinerary AB), conservative Adjustment 1 is preferable. Overall, however, there are clear benefits from forecast adjustment.

Empirical Study
To demonstrate the practical applicability of the proposed clustering and outlier detection, we apply it to a set of empirical data obtained from Deutsche Bahn.
This data set features only bookings of the 2nd class compartment, such that all bookings on one leg require capacity from the same compartment. The Deutsche Bahn long-distance network consists of over 1,000 train stations, letting the provider offer more than 110,000 direct origin-destination combinations. The numbers grow further when accounting for alternative transfer itineraries and multiple daily departures. Figure 11 shows We first apply the correlation-based clustering approach of Section 2.1, using a threshold of 0.5, such that only legs with a minimum correlation of 0.5 can be in the same cluster. In Figure 12a, coloured bubbles indicate the four resulting clusters: Each train line splits into one large and one small cluster.
To evaluate clustering on empirical data, where the true underlying demand for each itinerary is unknown, we use the network topology to check whether the resulting clusters are plausible. To that end, we propose the following set of rules: can transfer between lines, we expect relatively few passengers to make the same connection. Further, it makes sense to consider train lines separately for forecasting and analyst interventions.
• Train lines are further split into separate clusters on either side of a major station. As many passengers leave the train at a major station and many different passengers board, we shall assume a relatively small proportion of passengers book itineraries that pass a major station. Similarly, given that itinerary demand share is driven by which journeys are most common, and passengers often either board or alight at a major station, it is intuitive to have a cluster that contains the legs between major stations.
Deutsche Bahn assigns an ordinal indicator of importance to each station, ranging from 1 to 7. We define a major station to be in Category 1. The entire Deutsche Bahn network includes 21 major stations, whereas the considered network section includes nine major stations. Figure 12b highlights major stations in grey and shows the clusters resulting from the above rules.
The correlation-based clustering returns four clusters, whereas the rule-based clustering returns nine. Nevertheless, the resulting clusters share similar features.
Firstly, the two distinct train lines end up in different clusters in either approach.
For legs in distinct train lines, correlation tends to be higher between legs that share a transfer station, but not to a convincing extent -the correlation is at most 0.22. A correlation threshold of 0.27 creates two clusters (one for each train line).
Secondly, the breakpoints for the correlation-based approach are a subset of the breakpoints, i.e. major stations, in the rule-based approach. We conclude that the correlation-based approach achieves similar results as the rule-based approach without requiring expert input. We can formally compare clustering results using the Normalised Mutual Information (NMI) (Amelio and Pizzuti, 2015). The NMI is 1 if two clusterings are identical, and 0 if they are completely different. Figure 13a shows the NMI between the correlation-and rule-based approaches while varying the threshold in the correlation-based approach from 0 to 1. This shows that both approaches achieve similar results, with an NMI reaching 0.899.
The approaches are generally more similar at higher correlation thresholds (around 0.7) since the rule-based approach generally creates more clusters. Figure 13b compares the number of clusters of the two approaches -as the correlation threshold changes, the number of clusters ranges from 1 (everything in a single cluster) to 28 (each leg in its own cluster), demonstrating the flexibility of the correlation-based approach.

Large network subsection
We extend the empirical study to five train lines to further demonstrate the complexity that considering the network structure brings to clustering and outlier detection,

Detecting outliers in the Deutsche Bahn data
Having established clusters, we apply outlier detection independently to each cluster. To exemplify this on empirical data, we apply the outlier detection procedure to a representative four-leg cluster from the Deutsche Bahn network. Applying the proposed outlier detection approach to empirical data cannot precisely judge detection accuracy, given there is no labelled data on genuine outliers. However, this analysis demonstrates the full process of outlier detection on empirical data including e.g. seasonality and underlines practical implications.  To pre-process the data for outlier detection, we transform the booking patterns by applying a functional regression model (Ramsay and Silverman, 1997). We then apply the outlier detection to the residual booking patterns. In this pre-processing, we correct for three factors: (i) the departure day of the week; (ii) the departure month of the year; and (iii) the length of the booking horizon. 1 The functional regression fits a mean function to the booking patterns for each different factor in the model. Table 9 in Appendix D.1 compares models including different factors. Let y nl (t) be the n th booking pattern for leg l. Then:   Table 11b. The clustering approach can consider either the correlations between the booking patterns or the residual booking patterns. Given that the functional depth (the basis for the outlier detection) is calculated on the residuals, we suggest using the correlation between residual patterns to define the clusters. For this data set, the same clusters resulted in either case.
We calculate the functional depth of each booking pattern and compute the threshold as described in Section 2.2. We then transform the depths as per equation (2) to obtain z nl , as shown in Figure 16. The sums of threshold exceedances, z n , were shown earlier in Figure 2, with the empirical distribution and fitted generalised Pareto distribution shown in Figures 3a and 3b, respectively.

Conclusion and outlook
In this paper, we proposed a two-step method for (i) clustering legs in a mobility network that could benefit from joint outlier detection, and (ii) detecting outlying demand within such clusters. Furthermore, the proposed method, FD+Agg, ranks Furthermore, we implemented a simulated revenue management system to measure the potential revenue benefits of identifying and adjusting for demand outliers in a network setting by applying forecast adjustments across a cluster of legs. This analysis showed that taking into account the similarity of the legs can improve revenue in most scenarios. In the less likely scenario where only one or two legs of a cluster are affected by outlier demand, risk-averse firms may prefer individual leg-level adjustments.
Finally, by applying the proposed approach to empirical booking data collected by Deutsche Bahn, we demonstrated its applicability and scalability to the type of data observed in practice. In particular, we used this analysis to showcase the expected cluster results and to demonstrate how to account for additional practical considerations, such as trend and seasonality. Note that once the clustering has been performed, the outlier detection can be performed in parallel within each cluster. Therefore our methodology is scalable to a much larger data set, such as the entire Deutsche Bahn long-distance train network. Such an analysis is not included in this paper as, beyond giving excessive insight into confidential company data, the research insight to be gained from visualising even more complex network cut-outs is limited. given itinerary, only considering such itineraries risks systematically ignoring outliers from smaller itineraries and feeder legs. Secondly, when offering many potential itineraries, providers rarely store all booking patterns per itinerary. For example, capacity-based RM, as described in Strauss et al. (2018), frequently considers leg booking patterns to ensure capacity availability on each leg of a requested itinerary. Accordingly, the methodology proposed here is compatible with capacity-based RM.
Finally, even in the idealised case of having large volumes of stored itinerary-level data for every possible itinerary in the network, then running outlier detection algorithms quickly becomes computationally infeasible as the number of possible itineraries grows rapidly with the size of the network. Detecting outlying clusters of legs, rather than individual itineraries, overcomes all three challenges, as we have demonstrated in this paper. We do however note that the outlier detection methodology we propose could be applied directly to itinerary data without performing clustering. However, we only recommend this for densely booked itineraries, as otherwise, zero-inflated data can induce inferior results. We explored this further in Appendix C.4.
Constrained versus unconstrained bookings: Observed bookings are constrained by any revenue management controls that were in place at the time of booking, whereas revenue optimisation models rely on unconstrained demand forecasts (Talluri and Van Ryzin, 2004, Chapter 9.4). To represent this practice, we analysed constrained bookings in this paper and analysed the effect of adjusting unconstrained forecasts in the computational study. In that vein, further research could also consider the impact of applying the analysis to constrained observations, as showcased here, versus applying it to unconstrained demand estimates, which are frequently used for demand forecasting.
Implications for decision support: Further research is needed to consider the practical aspects of outlier detection from the perspective of decision support. Outliers manifest as changes in arrival rate, price elasticity, or other variables that affect bookings. Outliers can be caused by stochasticity but also by changes in demand patterns as a result of external factors, such as specific events. Complemented by further analysis, successful outlier detection could have three potential uses for RM: 1) Detecting outliers early within the booking horizon through online analysis as proposed in Rennie et al. (2021), allowing for rapid interventions; 2) removing any detected outliers from training data for demand forecasting to improve results on predicting reference demand curves; and 3) if outliers can be attributed to specific events, the forecast model could be extended to include such events. Outlier detection can have broader benefits for operational planning in transportation networks, helping service providers to avoid overcrowding and delays. To realise such benefits, future research should particularly focus on effective ways to visualise outliers in networks and to communicate alert lists to planners. To further support analysts in their decision-making, additional measures could be included in the alert list.
These might include average fare in the affected cluster, potential revenue loss if the outlier is not accounted for, or the outlier severity resulting from running the outlier detection procedure on revenue (instead of booking) patterns. An interesting avenue of further research would be to incorporate a feedback element whereby analysts mark outlier alerts as useful or not useful. A supervised learning approach, e.g. one-class-classifiers, could then be combined with our proposed outlier detection routine to filter out false alerts. Analysts could additionally include feedback on the quality of the clustering approach.
Clustering methodology: Investigating the use of alternative clustering approaches is of interest -especially where the clusters are likely to be of different structures compared to the rail industry, e.g. in the airline industry where hub and spoke networks are more common than lines. Whilst this paper relied on clustering to improve outlier detection, we believe that the clustering approach is a useful contribution in and of itself. For example, clustering presents additional research avenues such as its application to improving network-level forecasting; supporting the planning for future new stations; evaluating how the transport network structure is changing over time or defining different travel zones. Finally, further research opportunities lie in considering how the success of network outlier detection depends on the network structure. This paper featured examples from transport, specifically railway networks. Other application areas of RM, such as hotels, where correlation is induced by bookings for multiple consecutive nights, feature sparser or structurally different service networks.

Acknowledgements
We gratefully acknowledge the support of the EP/L015692/1 STOR-i Centre for Doctoral Training funded by the Engineering and Physical Sciences Research Council. The authors thank Deutsche Bahn for the provision of data and are particularly grateful to Philipp Bartke and Valentin Wagner for helpful discussions and suggestions.

Data availability statement
The simulated data and source code that support the findings of this study are available from the corresponding author upon request.
R functions for outlier detection are available at: github.com/nrennie/outlierdetection-in-network-revenue-management. The code also includes some examples of the simulated data. Functions for simulating demand can be found at github.com/nrennie/simnetdemand.

Appendices Appendices A Additional details of method
Appendix A provides additional details on the proposed method described in Section 2, including the specifics of the correlation-based minimum spanning tree clustering, and the calculation of the functional depths.

A.1 Functional dynamical correlation
Let y n,ij (t) be the total observed bookings for the n th departure on leg ij up to booking interval t, and similarly for y n,jk (t). The functional dynamical correlation between the booking patterns y n,ij (t) and y n,jk (t) is: where y * n,ij (t), y * n,jk (t) = y * n,ij (t)y * n,jk (t)w(t)dt, and w(t) is a weight function that accounts for the time gap between observations.
Here, y * n,ij (t) is a standardised version of y n,ij (t): where µ ij (t) is a mean function, and: The functional dynamical correlation is then the average across all N departures:

A.2 Prim's algorithm
Prim's algorithm is a greedy algorithm with the following basic steps. Assuming the original graph G has V (G) vertices.
• Initialise the MST, T , with the edge with minimum weight and the two vertices it connects. Let V (T ) be the number of edges in T .
• While V (T ) < V (G): go through the remaining edges in G in order from smallest to largest weights, until one is found that is connected to T , but does not form a circuit (i.e. the edge does not form a loop such that T is no longer a tree).
-Add this edge (and the vertices it connects) to T .
More computationally efficient algorithms exist but given the reasonable size of the graphs considered, and more specifically their sparsity (very few stations are adjacent), the computational time is reasonable using Prim's algorithm.

A.4 Normalised Mutual Information
For a graph containing M legs, the mutual information between two clusterings A and B of the M nodes in the inverted graph is defined as: where M a is the number of nodes in the a th cluster of clustering A, and similarly for M b . The normalised mutual information (NMI) between two clusterings is defined as (Amelio and Pizzuti, 2015): where H(A) is the entropy (a measure of uncertainty) defined as: The Bellman equation for V t (x) is: where λ j (t) is the arrival rate of demand for fare class j in interval t: and ∆V t+1 (x) = V t+1 (x)−V t+1 (x−1) is the marginal cost of capacity in the next time period. The problem is solved with backwards recursion, with the following boundary conditions apply: These ensure (i) no revenue can be generated beyond the booking horizon i.e after departure; and (ii) that no further revenue can be generated if there is no capacity remaining. The bid price at time t with remaining capacity x is given by ∆V t (x).

B.2 Details of benchmark method
We use the method proposed by Hyndman et al. (2016) as a benchmark comparison for our proposed method in Section 3. The method works as follows: • Define the total demand booking patterns as the sum of the demand for each leg within the cluster.
• Compute f features of the n total demand booking patterns. • Apply principal component analysis (PCA) as per Yang and Shahabi (2004) to determine the first two principle components i.e. those that explain the most variance.
• Use a density-based multi-dimensional approach (Hyndman, 1996) to find points in the first two principal components with the lowest density.
• The nu points with the lowest densities relate to the departures which are classified as outliers.

B.3 Parameter values for simulation study
The parameter valued used to generate the demand in the computational study are outlined below.
Proportion of total demand from each customer type for each itinerary. It is assumed these are constant across itineraries.  Table 3 shows the different experiments that were carried out as part of the computational study. We consider cluster outliers in which every itinerary within the cluster is equally affected; itinerary outliers where only a single itinerary within the cluster is affected; and station outliers which affect all itineraries that end at a particular station.

Experiment Outlier Type Itineraries Affected
Magnitudes

C Computational results
Appendix C includes the extended results from the computational study described in Section 3. Results from additional simulation experiments to test the proposed clustering approach are also presented here.

C.1 Evaluation of network clustering
For the correlation-based clustering to perform well it needs to (i) accurately estimate similarity between adjacent legs, and (ii) use information about the pairwise similarity between adjacent legs to detect similarity between (potentially) more than two legs to form clusters. We use the proportion of total demand belonging to each itinerary to determine a clustering benchmark. For example, in Figure 18a, when all passengers travel the itinerary from A to E, the resulting bookings in each of the four legs would be identical. In this case, the correlation between legs would be 1 -giving a single cluster of four legs.
To evaluate the clustering when the underlying demand is known, we define the common traffic ratio between two adjacent legs as the proportion of total demand that relates to itineraries over both legs. That is, for two legs ij and jk, we define the common traffic ratio, r(ij, jk), to be: where D ij is the demand for itinerary ij, and D ik is the total demand for all itineraries which include both legs ij and jk. If all passengers book itineraries that traverse both legs, then r(ij, jk) = 1. Conversely, if no passengers book journeys that traverse both legs, then r(ij, jk) = 0. • Case 1: When itinerary AE accounts for at least 50% of the network demand, we expect legs AB, BC, CD, and DE to belong to the same cluster, as they experience mostly the same demand. The remaining demand is calibrated across itineraries such that the total demand for each leg is reasonably uniformly distributed. We compare the correlation-based clustering with the benchmark clustering of all four legs in a single cluster, when the average percentage of demand on each leg from itinerary AE is 50%, 60%, 70%, 80%, 90%, or 100%. Figure 19a shows the fraction of total demand on each leg, from each itinerary, in the case where 60% of demand is for itinerary AE.
• Case 2: We calibrate the majority of demand on leg AB and BC to be for itinerary AC, and the majority of demand on legs CD and DE to be demand for itinerary CE. For simplicity, the distribution of demand is symmetric across the four legs. We compare the performance when the average percentage of demand on each leg belonging to the clustering benchmark itinerary is 50%, 60%, 70%, 80%, 90%, or 100%. Figure 19b shows the case where 60% of demand on each leg is for the respective cluster itineraries (AC or CE).
• Case 3: We calibrate the majority of demand on leg AB for itinerary AB, the majority of demand on leg BC for itinerary BC, and so on. We compare the performance when the average percentage of demand on each leg belonging to the leg itinerary is 50%, 60%, 70%, 80%, 90%, or 100%. Figure 19c shows the case where 60% of demand on each leg is for the itinerary consisting of only that leg.
The results are shown in Table 4.  In almost all cases, the normalised mutual information between the correlationbased clustering and the benchmark equals 1, indicating congruence. We now extend the simulation study by comparing the output of the correlation-based clustering under different correlation measures. In additional to the functional dynamical correlation measure described in Section 2.1, we compare Pearson correlation (Pearson, 1895) and Kendall rank correlation (Kendall, 1938). Let y n,ij (t) be the observed bookings for the n th departure on leg ij, and y n,pq (t) analogous for leg pq.
• Pearson correlation: calculate the Pearson correlation between corresponding booking patterns, then average across all booking patterns. That is, for the n th of N booking patterns observed over T booking intervals, we calculate the Pearson correlation coefficient as: where y n,ij is the mean number of bookings for the n th booking pattern. Then: • Kendall rank correlation: observations (y n,ij (s), y n,pq (s)) and (y n,ij (t), y n,pq (t)) where s < t, are concordant if their ordering agrees, and discordant otherwise.
The Kendall rank correlation is defined between the n th booking patterns in legs ij and pq as: where t c is the number of concordant pairs, t d is the number of discordant pairs, and t 0 , t 1 , and t 2 are defined as follows: where u s is the number of tied values in the s th group of ties for in booking patterns for leg ij, and v t is analogous for leg pq. Then: We compare the cases where the correlation measure is (i) applied directly to the booking patterns, and (ii) applied to the booking patterns where the within-booking pattern relationships e.g. trends have been removed. The normalised mutual information between the clustering produced by the correlation-based clustering under each of the different correlation measures and the benchmark clustering is shown in  Functional dynamical correlation, however, continues to perform well with an NMI close to 1.
In order to determine why the Pearson and Kendall rank correlations initially appear to perform well in the single cluster case but fail in the two cluster case, we also compare the value of the correlation coefficient with the known demand share in a simple two-leg example. Consider the simple two-leg network shown in Figure   20.
Leg AB Leg BC Figure 20: Network with two legs The common traffic ratio of legs AB and BC is: If r(AB, BC) = 1, then the number of bookings on leg AB and leg BC are identical, and the correlation between them is 1. Conversely, if r(AB, BC) = 0, then the bookings on leg AB and leg BC are independent with correlation 0. Table 6 shows the estimates of the correlation, compared to the true ratio, r(AB, BC).   Table 7 provides the results shown in Figure 5 in tabular format.  We recognise that the percentage of departures that analysts are able to adjust strongly depends on the ratio of analysts to departures and that this is likely to be domain-dependent. Therefore, here we consider outlier detection performance as the functional depth threshold varies.
In This is quite a high percentage for them all to be considered outliers.

C.2.3 Distribution of outliers across multiple legs
In the scenario where all itineraries are equally affected, a high proportion of outliers should be detected in more than one leg. Figure 22a illustrates the proportion of outliers detected in 1, 2, 3 or 4 legs: More than half were detected in multiple   These results motivate aggregating threshold exceedances across legs in two ways: (i) since less than 100% of genuine outliers were detected in all legs, if outlier detection was carried out only on the leg level, outliers could be missed on some legs. (ii) Given that a much higher proportion of outliers detected in four legs were genuine outliers, by ranking booking patterns detected in all legs as more likely to be outliers, we focus analysts' attention on those more likely to be genuine outliers.

C.2.4 False Discovery Rate
The false discovery rate (FDR) is defined as the proportion of booking patterns classified as outliers which were false positives: See Section 3.4 for definitions of true and false positives. Figure 24 shows the FDR for the case where outlier demand affects all itineraries, and the magnitude is randomly chosen from each of the distributions described in Section 3.2. Figure 25 shows the FDR for each of the magnitudes of outliers considered in the simulation study. Given that smaller magnitude outliers are more similar to the regular demand, these result in higher false discovery rates.

C.2.6 Outliers affecting a subset of itineraries
We consider a case where demand outliers affect only a subset of itineraries. Practical examples for this phenomenon could include trade fairs or conventions as well as regional crises. In such situations, demand towards (or from) a specific destination is most affected. Here, clustering offers additional benefits in guiding analysts towards those itineraries where they should adjust the forecast or controls.
We differentiate four scenarios based on the four-leg-network described in Section For each of the four possible events considered, we investigate the case where this generates 50% increase in average leg demand. For simplicity, we assume these passengers are equally split between the itineraries which alight at the relevant station. Table 8 shows the resulting demand increases for each leg.

Event at Station
Itineraries Affected Additional 120 Passengers in Itineraries Resulting Demand Increase per Leg A-E, B-E, C-E, D-E +30 (12.5%) +60 (+25%) +90 (+37.5%) +120 (+50%) Table 8: Changes in leg demand resulting from an additional 120 passengers in itinerary demand Figure 27a shows the true positive rate for each of the cases. Although the event at E generates outliers in more legs, it is not the case that it has the highest true positive rate. This shows that though the approach aggregates across legs, it does not ignore outliers only in a subset of those legs, provided they are sufficiently large. These effects may also be caused by interactions between the booking limits on different legs. For example, in the case of an event at C, large increases in demand in legs AB and BC may cause booking limits to be reached earlier for these legs, which also limits bookings in itineraries such as AD and AE. Hence, an increase in demand for some legs may cause a decrease in bookings for different legs.
By jointly considering multiple legs for outlier detection, we are able to detect the knock-on effects of outliers even when the change in demand only affects a subset of legs. The change in precision can be interpreted similarly, in Figure 27b.
Had we considered outlier detection on a leg-by-leg basis, the outliers were more likely to be missed in some of the legs. By combining information across legs, we are better able to determine which itineraries are affecting the volume of demand.

C.2.7 Limit alert list length via outlier severity thresholds
The results in this paper focus on limiting the length of the ranked alert list simply by the number of alerts it contains as this is most relevant to analysts. However, an alternative approach limits the length of the list by the outlier severity assigned to each departure. For example, classifying a train as an outlier only if its outlier severity is above 80%.   Figure 29 shows the true positive rate as the outlier severity decreases from 100% to 0%, for each magnitude of outlier considered. Results are similar to those shown in Figure 8.

C.3 Detection results when demand is seasonal
In this section, we consider the case where demand is non-stationary across departures. In order to make the results of this simulation study more comparable with the other simulation studies included in the paper, we simulate 10 seasonal groups (somewhat analogous to months). We then take 50 "days" from each of the 10 groups to produce 500 booking patterns per simulation, as in the other simulation experiments. All other parameters remain fixed in the simulation.
We then fit a functional regression model (similar to the model described in Section 5.2 for the empirical study) to remove the seasonality before performing the outlier detection. The quality of the outlier detection will highly depend on the

C.4 Detection results using itinerary-level data
The methodology described in this paper focuses on the application to leg-level booking data -since in capacity-based RM only leg-level capacities are required to be stored. Often, the itinerary-level data is never recorded. However, if itinerary level data is available for analysis, one could apply the same approach (either with or without the aggregation step). The results of doing so are presented in this section. Figure 33 shows the true positive and false discovery rates from the alert lists generated for each itinerary. It shows that the outlier detection performs well in the most popular itinerary (the itinerary that defines the cluster -A→E). The results are comparable to running the outlier detection procedure on the leg-level data. In the remaining nine itineraries with far fewer bookings, the outlier detection performs poorly, as there are too few bookings to detect a pattern.  Figure 34 shows that when the itinerary-level outlier detection results are aggregated across itineraries, the overall performance is poorer than the leg-level aggregation. These results suggest that for itinerary-level outlier detection, the better approach would be to identify important itineraries and run the outlier detection routines on those without aggregation. This, however, means that outliers that occur only in a small part of the network (which can cause knock-on effects) would be systematically overlooked. To consider the full network, the leg-based approach that accounts for itineraries through clustering performs better.
To further illustrate that the issue with itinerary-level analysis is primarily caused by an insufficient number of bookings for less-popular itineraries (making The analysis in Section 4.2 constitutes a best-case scenario in which we assume that, if outlier demand affects a particular leg, the outlier is detected in that leg.
However, as we show in Section 3.4, even when demand outliers affect multiple legs, the outlier is not always detected in every leg due to noise. Therefore, we additionally compare different adjustments based on the output of the outlier detection, for an outlier in itinerary AE.
• Adjustment A: Adjust only the forecasts of the affected single-leg itineraries for those legs in which the outlier is detected.
• Adjustment B: Adjust the forecasts of the affected single-leg itineraries for those legs in which the outlier is detected, and the cluster spanning itinerary (AE).
We compare these both to making no adjustment and to the oracle adjustment.
This is still a best-case scenario to some extent, given that we assume the correct magnitude of adjustment is made. Figure 37 shows the revenue under adjustments A and B (as described in Section 4.2) depending on the output of the outlier detection procedure. Combining adjustments on the leg level with those on the cluster level provides superior results in contrast to leg level adjustments alone. Though making adjustments to only the single-leg itineraries may be risk-averse in the rare cases where an outlier affects only a small subset of the legs within a cluster, it may be detrimental to revenue when outliers affect multiple legs.   Figure 38 shows the residual booking patterns resulting from the functional regression applied in equation (12) of Section 5.2. Compare with Figure 15 of Section 5.2 -the obvious outliers are preserved.  Table 10 shows an example of the coefficients for each explanatory variable over the 19 booking intervals, along with their standard errors. Not all days of the week, or months of the year, are significant throughout the entire booking horizon. Similarly, some days of the week have much stronger significance than others e.g. Saturdays have a much larger impact than Tuesdays. Given that, if both µ = 0 and ξ = 0, the GPD reduces to an exponential distribution, it is appropriate to compare the fit of the GPD with an exponential distribution to check if the inclusion of additional parameters is beneficial. Figure 40 shows the P-P plots, i.e. the fitted theoretical CDF against the empirical CDF for the GPD ( Figure 40a) and the Exponential distribution (Figure 40b). The GPD provides a closer fit to the empirical data and the additional parameters better account for the shape of the distribution. The GPD does not provide a perfect fit, with the  However, given that we assume points with very low probability are more likely to be false positives, under-estimating may actually be beneficial. Further, only the highly-ranked outliers i.e. those with high probability, are likely to be considered by an analyst due to time constraints. The GPD provides a very good fit for those data points. If there is a sufficiently large number of threshold exceedances, an empirical distribution could alternatively be used to compute the probabilities.

D.6 Distribution of outliers across multiple legs
The proportion of outliers found in each number of legs is shown in Figure 41, with over half of the outliers detected in multiple legs. Compared with Figure 22, this shows a similar proportion of outliers as found in the simulation study. We also compare the correlations between the different legs for both the empirical and simulated data. Table 11 shows the functional dynamical correlation between the empirical booking patterns, and empirical residual booking patterns, for each leg. Table 12 shows the corresponding correlations between the simulated booking patterns. The values are similar and the rate of decay between legs as they get further apart follows a similar pattern.