1 Introduction and state-of-the-art

Revenue management (RM) is a challenging task for network service providers. The concept entails controlling the set of product offers over a fixed sales horizon such that, given the predicted demand for the offers, the expected revenue from selling a limited capacity is maximal. To thus maximise revenue, the firm has to forecast the expected demand for all products that require capacity on mutual resources.

Examples include transport itineraries that cross several network legs, and hospitality offers that combine room availability for multiple nights.

Several existing contributions, e.g. Weatherford and Belobaba (2002) and Rennie et al. (2021), demonstrate the negative effects of inaccurate demand forecasts on revenue performance but neglect network effects. Motivated by this, we propose a new approach to detect outliers in network bookings, thereby supporting forecast corrections for improved network revenue management.

1.1 Terminology

For simplicity, we employ a transport-based terminology throughout this paper: a leg describes a direct non-stop connection between two stations in a network, and an itinerary is any combination of legs that can be jointly booked as one product. A departure describes a journey along a connected series of legs that leave the origin station at a unique time and date.

We denote the accumulation of bookings across the sales horizon as a booking pattern. We define an outlier as a booking pattern resulting from short-term systematic demand changes for one or several related network itineraries. These outliers occur when demand deviates from the baseline due to unforeseen events. For example demand increases for a specific destination affect the entire itinerary. In consequence, deviations in the booking patterns are observable on the legs arriving at that destination and in the feeder legs.

Capacity-based revenue management differentiates offers through fare classes. Fare classes describe combinations of fares and tariffs at which the firm offers a product. Customers booking a ticket for a specific departure may choose from several offered fare classes. For instance, the cheapest offer could be fare class ‘M’, costing 20 Euros and entailing a no-refund tariff.

1.2 Existing work

RM is a well-studied problem for many different products and services (Talluri and Van Ryzin 2004). Still, only recently the specific issues around network services and demand forecasting have come into focus. E.g. Klein et al. (2020) review how single-leg practices to RM generalise to the network setting. Weatherford (2016) surveys RM forecasting methods and focuses on airline itinerary-level forecasting.

So far, few authors have examined demand outliers in RM data. For historical hotel booking data, Weatherford and Kimes (2003) discuss a simple method of removing observations that are more than \(\pm 3\sigma\) away from the mean. Rennie et al. (2021) apply functional analysis to detect outliers on individual legs. Neither, however, consider outliers affecting multiple legs of a network. In Azadeh et al. (2013), the authors identify outliers in network railway bookings via a simple rule to remove them before forecasting future demand. For a slightly different perspective, Kumar and Khani (2020) analyse transit demand for outliers to detect special events. Notably, existing research on outlier detection frequently focuses on binary outlier detection without regard for quantifying how critical an outlier is.

Practical network RM relies on manual forecast adjustments (Quante et al. 2009; Schütze et al. 2020). The previous research has shown that the resulting judgemental forecasts can be biased and even superfluous (Lawrence et al. 2006; De Baets and Harvey 2020). Perera et al. (2019) note that forecasting support tools can improve user judgement by reducing complexity for the analyst. Analysts’ time is limited, so they cannot investigate every departure flagged as an outlier. For example Deutsche Bahn experts estimate that they can reasonably adjust less than 1% of forecasts. Therefore, ranking outliers by criticality is crucial.

Beyond RM, Barrow and Kourentzes (2018) propose a functional approach for outlier detection in call arrival forecasting without regard for network effects. General outlier detection in networks often focuses on identifying outlying parts of the network. Fawzy et al. (2013) use this approach in wireless sensor networks to find faulty nodes. Ranshous et al. (2015) consider the extension to identify outlying nodes when the network changes over time. Most research on dynamic networks concentrates on analysing a single time series connected to each node rather than a set of time series, as required when booking patterns are reported for multiple departures. Hyndman et al. (2016) note that the problem of identifying unusual time series within a collection is not as extensively studied as other outlier detection problems. In this paper, we benchmark the approach suggested by Hyndman et al. (2016), which employs principal component analysis (PCA), against our newly proposed approach.

1.3 Contribution

We shall study booking patterns that result when customers book not just a single resource (leg) but network products that require multiple resources (itineraries). Such booking patterns may be reported on the leg or the itinerary level; in this paper, we assume that they are reported per leg and departure. This applies in the case of Deutsche Bahn, which serves as a motivation and empirical demonstration for the work presented here.

Network effects challenge outlier detection in two ways: On the one hand, demand outliers on the itinerary level affect bookings on all legs included in the itinerary. On the other hand, such outliers may not be recognisable when only considering leg bookings independently, given the noise from other itineraries overlapping those legs. As a result, directly extracting outliers from booking data collected in an entire, realistically sized network is likely an intractable problem. To circumvent this problem, in this paper, we aggregate and analyse booking patterns from legs instead of itineraries, as this allows for computationally and statistically tractable network-wide outlier detection. In Sect. 6, we further discuss the choice of leg-level- vs. itinerary-level-based analysis and point out how our procedures could be adapted to itinerary-level outlier detection.

Our network outlier detection procedure: (i) clusters legs with similar booking patterns and (ii) detects joint outliers within each cluster to compile ranked alert lists of outlying departures and affected legs. Our methodology significantly improves outlier detection performance in a network setting versus alternative methods.

In more detail, our proposed approach first clusters legs by measuring the similarity of booking patterns via functional dynamical correlation (Dubin and Müller 2005). We suggest this measure for its freedom from restrictive assumptions. As the proposed approach is modular, other correlation measures could be used for the same end. In the second step, the proposed approach detects outliers from booking patterns within each cluster by combining the functional data analysis methods of Febrero et al. (2008); Hubert et al. (2012); Rennie et al. (2021) with a novel within-cluster aggregation, which generates a ranked alert list of outliers using extreme value theory. This alert list can help analysts to identify the need for further analysis and adjustments. We consider an outlier as more critical if it indicates a larger demand shift and if it is identified across multiple legs. Factors such as the average fare on legs where outliers are detected, or the revenue at risk from faulty forecasts could also be incorporated into the definition of an outlier’s criticality.

Finally, analysts have several choices when tasked with forecast adjustment for network services. The best choice is not obvious, and we further quantify the impact of different potential adjustments on revenue in a simulation study, following concepts outlined in Kimms and Müller-Bungart (2007).

In summary, this paper contributes (i) a method for identifying network legs that will benefit from joint outlier detection and (ii) a method to aggregate outlier detection across any number of legs to create a ranked alert list. To thoroughly evaluate the proposed approach, we offer (iii) wide-ranging simulation studies to benchmark the method’s outlier detection performance against and to quantify the potential revenue improvements from forecast adjustments and (iv) a demonstration of applicability on empirical railway booking data from Deutsche Bahn.

2 Method

Several network products may rely on common resources when demand concerns multiple legs at once. In the transport example, even passengers that booked different itineraries often have to traverse the same legs. Therefore, specific legs share common outliers, as, for example a sudden increase in demand from passengers travelling from one end of the network to attend an event at the other end would increase demand for each of the in-between legs. Neither considering each leg independently nor jointly considering the whole network will create the best results when the network spans multiple regions that differ strongly in demand—see Sect. 3.4 and Appendix 6. This raises the question of which legs to consider jointly for outlier detection.

To find an answer, in Sect. 2.1, we adapt a method by Zahn (1971) to cluster legs such that (i) legs in the same cluster share demand and can be considered jointly for outlier detection, and (ii) legs in different clusters experience distinct demand and should be considered separately. Subsequently, in Sect. 2.2, we suggest a method for analysing bookings within one such cluster. Based on this, we propose a method to rank departures by the severity of identified outliers.

2.1 Clustering legs using correlation-based minimum spanning trees

To cluster legs based on correlations in observed bookings, we first consider the network as a graph where nodes represent the stations and edges represent the legs of a journey. Figure 1a illustrates this on a simple network. To illustrate the concept, we rely on an example from the transport domain: In this example, two train lines (red and blue) intersect at two stations (B and C). The red train arrives at stations B and C before the blue train, which creates two possible transfer connections for passengers: (i) switch from red to blue at B and (ii) switch from red to blue at C. Transfers from the blue to red train are not feasible.

Fig. 1
figure 1

Correlation-based minimum spanning tree clustering

Standard graph clustering algorithms, as exemplified in Schaeffer (2007), seek to cluster the nodes of the graph. In contrast, we wish to cluster similar edges, which correspond to legs in the railway example (Fig. 1a). Hence, we invert the graph to make existing clustering algorithms applicable. In this inversion (Fig. 1b), the directed edges become nodes, e.g. the edge from A to B becomes node AB. The inverted graph features an undirected edge between two nodes when:

  • both legs are in the same train line and share a common station, e.g. legs CD and DE are connected through station D, or

  • the legs are in different train lines but share a common transfer station where a connection is possible, e.g. leg FB (red line) and BC (blue line) are connected through station B. However, AB (blue line) and BC (red line) would not be connected by an edge as no connection can be made between them (as we have assumed the red train arrives at B and C before the blue train).

In theory, this transformation could also create edges between legs that share a common entry or exit node, e.g. FB (red line) and AB (blue line), or CG (red line) and CD (blue line). Given that such pairs of legs would never occur in the same itinerary, we would not expect demand outliers to affect both legs. Therefore, inserting an edge between them, potentially allowing them to be in the same cluster, is counter-intuitive. In addition, exploratory analyses of the empirical data found that correlations between these types of legs were very low across the entire network section.

The algorithm aims to assign those legs that experience similar bookings to the same cluster and those that experience dissimilar bookings to distinct clusters. A corresponding metric only needs to consider the similarity between adjacent legs that share a connecting station since edges do not otherwise exist in the inverted graph. We propose to quantify this similarity via the correlation between booking patterns.

To calculate correlations between booking patterns, we compute the functional dynamical correlation (Dubin and Müller 2005). Functional dynamical correlation is based on calculating scalar products between pairs of smoothed booking patterns; the appendix provides further details. We use the average of these paired correlations over time as the similarity measure between any two legs. Unlike more common statistical correlation measures, such as Pearson correlation, functional dynamical correlation does not assume a specific type of relationship between variables (e.g. linearity). It also accounts for the time dependency between observations within the booking horizon when the intervals between observations vary. For example in the empirical RM data analysed in Sect. 5, the time between observations decreases as the departure date approaches. Further, alternative measures for calculating correlations from functional data (such as functional canonical correlation) often make restrictive assumptions, which real data do not fulfil (He et al. 2003).

We benchmark the clustering algorithm under alternative correlation measures in Appendix 8.

To represent the relationship between legs in the network, i.e. the nodes in the inverted graph, we attach weights to the edges in the inverted graph. These weights are interpreted as distances: A higher edge weight indicates that the connected nodes are more dissimilar. Therefore, an applicable weight function should be non-negative. Further, the weight function needs to ensure that any negatively correlated legs are marked as more dissimilar. Even though a negative correlation may imply that outlier demand jointly affects both legs, we expect it to affect negatively correlated legs differently. Therefore, these require different adjustments from an analyst and should be in different clusters. To satisfy these requirements, we define the edge weights as:

$$\begin{aligned} w_{(ij, jk)} = 1 - \rho (ij, jk), \end{aligned}$$

where \(\rho (ij, jk)\) is the correlation between bookings on legs ij and jk. Though the use of functional dynamical correlation as a measure of similarity between time series is not new, its application as an edge weight in a network setting, to our knowledge, is novel.

To allow for irregular cluster shapes, we recommend a minimum spanning tree (MST) algorithm (Prim 1957). For example in Fig. 1b, a cluster may include AB and DE because they are in the same line, rather than clustering AB and FB. Minimum spanning tree approaches work well for clusters with irregular boundaries (Zahn 1971). Alternative clustering approaches (such as k-means) often assume a specific shape of clusters (spherical, for k-means). MST-based clustering approaches also do not assume that clusters are of similar sizes (Peter and Victor 2010). This makes them particularly suitable for transportation networks constructed as a series of interlocking lines, where the points of intersection are often not equally spaced. For example MST-based approaches have previously been used in optimising layouts of railway networks (Liang et al. 2020).

A spanning tree of a graph is a subgraph that includes all vertices in the original graph and a minimum number of edges, such that the spanning tree is connected. Then, the MST is the spanning tree with the minimum summed edge weights—see Fig. 1c. Since the inverted graph is weighted, we use Prim’s algorithm (Prim 1957) to calculate the MST—Appendix 2 provides a detailed introduction. Any one-to-one transformation of the weight function, \(w_{(ij, jk)}\), will produce an identical minimum spanning tree.

There are two approaches to obtaining clusters from an MST: (i) pre-defining the number of clusters as k and removing the \(k-1\) edges with the highest weight; or (ii) setting a threshold for the edge weights and removing all edges with weights above some threshold, creating an emergent number of clusters. Here, we implement the threshold-based approach, ensuring that each cluster has the same minimum level of correlation. In contrast, setting the number of clusters in advance could result in very heterogeneous levels of correlation across clusters. Further, setting k too low may result in legs with dissimilar features being grouped together. We apply a threshold correlation of 0.5—the level at which legs are more correlated than they are not. This corresponds to a transformed edge weight of 0.5. In the example given in Fig. 1c, this means removing all legs with a weight above 0.5, resulting in the three clusters shown in Fig. 1d. The choice of this clustering threshold will impact the number of alert lists produced. Therefore, we recommend considering factors such as staffing resources and any current (informal) network clustering when choosing this threshold.

While the outlier detection procedure described next applies to individual clusters, it does not require a particular clustering approach. Hence, other implementations may employ alternative approaches, as reviewed in Schaeffer (2007). In particular, depending on the business context of the network service, alternative clustering algorithms may be more appropriate. The network topology should drive the choice of which clustering algorithm is most appropriate: That topology may differ, e.g. when considering airlines versus bike rentals versus railways, as discussed in Rennie et al. (2022). The choice of an MST-based approach, which often returns linear clusters, is appropriate for the railway application motivating the paper at hand, given the linear nature of the underlying network structure. We further evaluate the performance of MST clustering in this application in Appendix 8.

Furthermore, edge-based clustering could replace the graph inversion and node-based clustering presented here. However, the literature on edge-based clustering is far more limited, and such approaches tend to improve the visualisation of networks with a very high number of edges by reducing the number of edge crossings rather than grouping together the most similar edges (Qu et al. 2007). In contrast, inversion and node-based clustering aim to group network legs that exhibit the highest degree of similarity. However, alongside these advantages, there may be some drawbacks. The node-based approach requires deciding on criteria to select edges to include in the inverted graph.

2.2 Detecting outliers in clusters of legs

Given established clusters, we propose identifying demand outliers within each cluster and quantifying their severity to provide a ranked alert list of departures. The previously described clustering allows for processing the outlier detection in parallel for separate clusters, enabling efficient computing.

To identify which departures to include in the alert list, we consider the functional depth of the booking patterns, as in Rennie et al. (2021). This step could also rely on other measures of exceedance, including univariate 'threshold' approaches, which look at aggregated bookings and ignore the distribution of bookings over time. We propose to rely on functional depth, as the previous work has found this to be the most effective as an outlier detection mechanism (Rennie et al. 2021).

To compute the functional depth, consider N departures observed over L legs. Let \(\varvec{y}_{nl} =\left( y_{nl}(t_1), \hdots , y_{nl}(t_{T}) \right)\) be the booking pattern for the \(n^{th}\) departure on leg l, observed over T booking intervals \(t_1,\ldots ,t_T\). Let \({\mathcal {Y}}_l\) be the set of N booking patterns for leg l. For each leg and departure, calculate the functional depth (\(d_{nl}\)) given the related booking patterns following the approach given in Hubert et al. (2012) and detailed in Appendix 3. The functional depths take on positive values, with smaller values of the depths relating to more outlying booking patterns.

For each leg l, we calculate a threshold for the functional depth using the approach of Febrero et al. (2008). This method (i) resamples the booking patterns with probability proportional to their functional depths (such that any outlying patterns are less likely to be resampled), (ii) smooths the resampled patterns, and (iii) sets the threshold \(C_l\) as the median of the \(1^{st}\) percentiles of the functional depths of the resampled patterns. Here, we use the \(1^{st}\) percentile of the depths as the default threshold, as this has been found to work well in practice (Febrero et al. 2008; Rennie et al. 2021). Booking patterns with a functional depth below the threshold \(C_l\) are classed as outliers. We explore alternative threshold choices in Appendix 2.

To create ranked alert lists, we first define \(z_{nl}\) to be the normalised difference between the functional depth and the threshold:

$$\begin{aligned} z_{nl} = \frac{C_l - \textrm{d}_{nl}}{C_l}. \end{aligned}$$

This transforms the depth measure \(\textrm{d}_{nl}\) into a measure of threshold exceedance. Values of \(z_{nl}\) greater than zero relate to booking patterns classified as outliers. Normalising by the threshold, \(C_l\), ensures that the values of \(z_{nl}\) are comparable between different legs.

Next, we define the sums of threshold exceedances across legs:

$$\begin{aligned} z_n = \sum _{l=1}^{L} z_{nl} \mathbbm {1}_{\{z_{nl} > 0\}}. \end{aligned}$$

We sum only those values of \(z_{nl}\) that are greater than zero to avoid outliers being masked when they occur only in a subset of legs. This sum implicitly accounts for both the size of an outlier—larger outliers further exceeding the threshold, resulting in larger values of \(z_{nl}\)—and for the number of legs where a departure is classified as an outlier (by summing a larger number of nonzero values). To provide an example, Fig. 2 shows those values of \(z_n\) that exceed zero for a four-leg section of the Deutsche Bahn network as discussed further in Sect. 5.2. These values of \(z_n\) correspond to departures where the booking pattern for at least one leg is identified as an outlier. In contrast, all other departures have no detected outliers in any leg such that \(z_n=0\).

Fig. 2
figure 2

\(z_{n}\) as defined in Eq. (3) for a four-leg section of the Deutsche Bahn network

To create a ranked list of outlier departures, i.e. those with a nonzero-sum of threshold exceedances, we assign a severity \(\theta _n\). A higher value of \(\theta _n\) indicates that the departure is more likely to be affected by extreme outlier demand and hence should be targeted first by RM analysts.

To model threshold exceedances, we turn to extreme value theory (EVT)—a branch of statistics that deals with modelling rare events occurring in the tails of a distribution. Given that outliers are unusual events, which occur in the tails of distributions, EVT is a clear direction to turn to for modelling outliers—see Talagala et al. (2019). There are two common approaches to EVT: (i) block maxima, which examine the maximum value in evenly-spaced blocks of time, e.g. annual maxima, and (ii) peaks over the threshold, which examines all observations that exceed some threshold (Leadbetter 1991). The generalised Pareto distribution (GPD) is commonly used to model the tails of distributions in the peaks over threshold approach (Pickands 1975). Motivated thus, we fit a generalised Pareto distribution (GPD) to the sum of threshold exceedances given in equation (3). The GPD has three parameters with probability density function:

$$\begin{aligned} f(x\vert \mu , \sigma , \xi ) = \frac{1}{\sigma } \left( 1 + \frac{\xi (x-\mu )}{\sigma }^{\left( -\frac{1}{\xi } - 1\right) } \right) , \end{aligned}$$


$$\begin{aligned} x \in {\left\{ \begin{array}{ll} {[}\mu , \infty ) &{} \xi \ge 0 \\ {[}\mu , \mu - \frac{\sigma }{\xi }] &{} \xi < 0. \end{array}\right. } \end{aligned}$$

Here, \(\mu\) specifies the location, \(\sigma\) the scale, and \(\xi\) the shape of the distribution. We fit the parameters using maximum likelihood estimation (Grimshaw 1993) via the R package POT (Ribatet and Dutang 2019). A kernel density estimate of the empirical distribution of \(z_n > 0\) from Fig. 2 is shown in Fig. 3a. The resulting fitted GPD is shown in Fig. 3b. As the further analysis in Appendix 17 shows, the GPD fit appears reasonable compared to the empirical distribution.

Fig. 3
figure 3

Distribution of \(z_n\) values from Fig. 2

Two common issues arise in fitting GPDs: (i) the choice of threshold and (ii) the independence of the data points. When the threshold is too low, the assumption of a GPD no longer holds; when it is too high, there are too few data points to fit. We select a threshold of 0, i.e. we fit the GPD to values of \(z_n > 0\). Rather than change the threshold at the GPD level, we control the number of observations the GPD is fitted to by varying the percentile used for the individual leg thresholds, \(C_l\). We choose \(C_l\) as suggested by Febrero et al. (2008) and find that this choice works well and provides sufficient outlying points to fit a GPD in both simulated and empirical data.

To account for the second issue, applications of extreme value theory frequently first decluster the peaks over the threshold to ensure independence between observations (Fawcett and Walshaw 2007). To that end, the analysis may only consider the maximum of two peaks within a small time window. For transport departures, it is theoretically possible that observed outliers may be dependent; e.g. increased demand caused by Easter affects not only Easter Sunday but also the surrounding days. However, similar outliers may also result from independent events. As we aim to identify outlying departures rather than the underlying events, this argument causes us not to decluster here.

We define \(\theta _n\) as the non-exceedance probability given by the CDF of the GPD:

$$\begin{aligned} \theta _n = F_{(\mu , \sigma , \xi )}(z_n) = {\left\{ \begin{array}{ll} 1 - \left( 1 + \frac{\xi (z_n - \mu )}{\sigma } \right) ^{-\frac{1}{\xi }} &{} \xi \ne 0 \\ 1 - \exp \left( -\frac{(z_n - \mu )}{\sigma }\right) &{} \xi = 0 \end{array}\right. } \end{aligned}$$

Formally, \(\theta _n\) is the probability that, given an outlier occurs, the sum of threshold exceedances is at least as large at \(z_n\). Thus, it is not the probability that a departure is an outlier. However, we use this non-exceedance probability as a measure of outlier severity on a scale of 0–1.

Departures with functional depths that do not fall below the threshold on any legs carry a severity of zero, i.e. they are classified as regular departures. It is conceivable to estimate the uncertainty of \(\theta _n\) (Smith 1985) to determine further levels of criticality, e.g. if there are several departures with the same outlier severity, the one with the smallest uncertainty would be ranked first. However, given the continuous nature of the data, it is unlikely that multiple departures carry an identical severity. Hence, we leave uncertainty estimation to future research.

From the severity defined in equation (6), we construct a ranked alert list containing all departures with a nonzero outlier severity. Although functional depth could be directly used to construct the ranked alert list, computing the severity provides a measure of the difference between ranks and is more easily interpreted by analysts. The top 8 ranked outliers relating to Fig. 2, are shown in Table 1.

Table 1 Ranked alert list for cluster \(= \{AB, BC, CD, DE\}\)

In practice, RM analysts’ time and resources allow them to examine and adjust controls or forecasts only for a limited number of suspicious booking patterns. Those departures that (i) exceed the functional depth threshold in only one leg or (ii) exceed the threshold only to a small degree have lower but strictly nonzero severity. These outliers are most likely false positives and potentially waste analysts’ time. Hence, we suggest limiting the length of the list in practice.

To limit the length of the alert list, we might (i) only include departures if their severity is above some threshold or (ii) set a maximum length. Since we wish to control the number of alerts an analyst will receive, we analyse outlier detection performance as dependent on the maximum length of the alert list. Recall that we classify departures as outliers if and only if their outlier severity exceeds zero. Therefore, if the required length of the alert list exceeds the number of identified outliers, we do not include further departures. Appendix 7 features further result on the outlier detection performance when varying the outlier severity threshold.

3 Outlier detection performance

We first implement a simulation study to evaluate the outlier detection performance given known outliers. By varying the demand for itineraries in one cluster, we create outliers that are observable on both the leg and network levels.

The simulation models a network consisting of five stations and four legs, as shown in Fig. 4, mirroring the structure of an empirical railway network cutout. The network includes 10 possible itineraries represented by \({\mathcal{O}} = \{\mathrm{AB, AC, AD, AE, BC, BD, BE, CD, CE, DE}\}\). On each itinerary, the firm offers seven fare classes. In this model, a fare class describes a particular price or fare associated when booking a ticket to travel the itinerary in that class. There are no additional restrictions differentiating classes.

Fig. 4
figure 4

Four-leg cluster, dotted lines indicate 10 possible itineraries

3.1 Demand settings

Extending the demand model described in Rennie et al. (2021) to the network setting, the simulation generates booking requests per customer type i according to a non-homogeneous Poisson process, where the arrival rate per itinerary o, \(\lambda _{i,o}(t)\), at time \(t\), is given by:

$$\begin{aligned} \lambda _{i,o}(t)\vert (D_{o}=\textrm{d}_{o}) = \textrm{d}_{o} \times \phi _{io} \frac{t^{a_{io}-1}(1-t)^{b_{io}-1}}{B(a_{io},b_{io})}. \end{aligned}$$

here \(\phi _{io}\) is the fraction of customers of type i and \(D_{o} \sim \text{ Gamma }(\alpha _{o},\beta _{o})\) with probability density function:

$$\begin{aligned} f(\textrm{d}_{o}\vert \alpha _{o}, \beta _{o}) = \frac{\beta _{o}^{\alpha _{o}}}{\Gamma (\alpha _{o}) \textrm{d}^{\alpha _{o} -1}e^{\beta _{o} d}}, \end{aligned}$$

where \(a_{io}\) and \(b_{io}\) define are the parameters of a Beta distribution which defines how customers arrive over time. We generate demand over a horizon of 3600 time slices to ensure \(\lambda _{i,o}(t) < 1\). This level of detail is required to accurately parameterise the dynamic program for bid price control. The resulting bookings are aggregated into 18 booking intervals.

As in Rennie et al. (2021), we consider differentiated demand from two customer types represented by the set \({\mathcal {I}} = \{1,2\}\). We assume that customers book the cheapest available fare class and differ in price sensitivity. We define \(p_{ijo}\) as the probability that a customer of type i pays up to fare class j on itinerary o. By combining demand from two customer types that differ in price sensitivity with offers that depend on the current set of offered classes, we mimic a realistic price effect: Offer prices result from the cheapest class currently offered by the firm, as customers will buy the cheapest available class. When a customer’s willingness to pay does not equal or exceed the price of the cheapest available class, they do not buy; hence, their price sensitivity translates to decreased demand. Note that the price–demand response depends on the itinerary and time in the booking horizon.

Combining this demand model with the given network creates 210 demand parameters. Table 2 provides a full list of parameter values and interpretations of each parameter. We set the parameters to mirror common RM assumptions (Weatherford and Bodily 1992): (i) valuable customers from type 1 book later than customers from type 2, (ii) customers book earlier for longer journeys, and (iii) customers are willing to pay a higher fare class if they are travelling further. Most passengers book tickets boarding at A and leaving at E; this ensures the correlation between the legs exceeds 0.5 and guarantees that the legs are correctly modelled in the same cluster as detailed in Appendix 19.

We validate that the functional dynamical correlation between the four legs for simulated data is comparable to empirical railway data as detailed in Appendix 19. We generate all regular demand based on these parameters.

The simulation excludes trend and seasonality to evaluate outlier detection approaches in a best-case scenario. In other words, if an algorithm fails on observations from stationary demand, it will likely not perform better given more demand variability. However, additional results based on simulation data that do feature seasonality can be found in Appendix 10.

3.2 Outlier generation and evaluation

We generate demand-volume outliers by changing the gamma distribution parameters that govern the total demand level according to equations (7) and (8). The previous work found that the proportion of outliers had little effect on outlier detection performance in the single-leg case (Rennie et al. 2021). Therefore, we generate booking patterns for 500 departures per demand setting, with 1% of departures experiencing outlier demand. That is, we generate 495 departures from the regular demand distribution and five outliers from a set of twelve outlier distributions where the mean has shifted by \(\pm 10\%\), \(\pm 20\%\), \(\pm 30\%\), \(\pm 40\%\), \(\pm 50\%\), and \(\pm 60\%\). For every shift in the mean, we reduce the variance of the outlier demand distribution by \(80\%\). This still results in an overall increase in the variance of total demand in the presence of outliers but also ensures that we sample sufficiently outlying demand values. Outliers may also occur due to factors such as changes in arrival times or changes in customers’ willingness to pay. Rennie et al. (2021) provide results on how the performance of functional depth varies under these different types of outliers. Here, we focus on the different types of outliers caused by varying network effects. In all cases, we consider the application of the outlier detection procedure to the constrained demand—applying the approach directly to the booking patterns without applying any unconstraining approaches first. The problem of unconstraining is one of the major challenges of demand forecasting for revenue management and is beyond the scope of this paper.

We differentiate outlier scenarios in terms of the affected network components. Firstly, we evaluate a scenario where outlier demand affects all network itineraries. We consider the case where each outlier is randomly drawn from one of the twelve outlier distributions, resulting in outliers from a mixture of different distributions. This lets us test whether the ranking of the alert list mirrors the outliers’ underlying degree of demand deviation. Then, we consider each of the twelve outlier distributions in isolation to assess the detection sensitivity. Secondly, we evaluate a scenario where outliers only affect a single itinerary. This evaluates the benefits of clustering multiple legs.

In Appendix 6, we consider the practically relevant case of outliers affecting a subset of itineraries and provide further details on all simulation experiments.

Each combination of outcomes can be classified into one of four categories: (i) assigning a nonzero outlier severity to a genuine outlier creates a true positive (TP); (ii) assigning a zero outlier severity to a regular observation creates a true negative (TN); (iii) assigning a nonzero outlier severity to a regular observation creates a false positive (FP); and (iv) assigning a zero outlier severity to a genuine outlier creates a false negative (FN). This classification enables us to compute the true-positive rate (TPR) for the top R ranked departures in the alert list:

$${\text{TPR}}_{{\text{R}}} = \frac{{{\text{TP}}_{{\text{R}}} }}{{{\text{TP + FN}}}},$$

where \(\mathrm{TP_R}\) is the number of true positives in the top R departures. The true-positive rate lies between 0 and 1, where 1 means all genuine outliers were identified. We evaluate performance across 1000 stochastic simulations.

In an ideal setting, the alert list should feature, from top to bottom, large outliers and, subsequently, smaller outliers. Therefore, we also use the distribution of outliers within the ranked alert list to evaluate how well the method ranks the most critical outliers.

3.3 Benchmarked outlier detection approaches

For benchmarking, we term the newly proposed approach FD+Agg and compare it to two alternatives from the literature: Principal component analysis combined with high-density regions (PCA+HDR) as inspired by Hyndman et al. (2016), and the leg-based functional depth analysis as proposed in Rennie et al. (2021).

3.3.1 Comparison with PCA+HDR

This benchmark (i) computes features (e.g. mean, variance, and curvature) of the booking patterns for the total demand in a cluster; (ii) uses PCA (Yang and Shahabi 2004) to identify the first two principle components from the features; and (iii) uses HDR, a density-based approach (Hyndman 1996), to find the \(\nu\) points with the lowest density in the first two principal components. These points are classified as outliers. Extended details of the method, including the list of features, can be found in Appendix 6. This method provides an ordering of the outliers but not a severity measure, as illustrated by Fig. 5.

3.3.2 Comparison with non-ranked, single-leg approaches

To highlight to critical features of FD+Agg, we benchmark (i) the use of severity measures to rank outliers and (ii) the inclusion of network effects. To isolate the effects of each of these features, we perform two separate benchmark tests:

We evaluate the effect of ranking outliers by measuring the increase in precision when ranking outliers. For example we consider the precision in the top 5 ranked departures versus 5 randomly chosen departures with nonzero outlier probabilities (i.e. as in Rennie et al. (2021)). The change in precision when considering the top R departures, \(\Delta (Precision)_{R}\), is given by:

$$\Delta ({\text{Precision}})_{R} = \frac{{{\text{TP}}_{{\text{R}}} }}{{{\text{TP}}_{{\text{R}}} {\text{ + FP}}_{{\text{R}}} }} - \frac{{{\text{TP}}_{{{\text{R}}({\text{random}})}} }}{{{\text{TP}}_{{{\text{R}}({\text{random}})}} {\text{ + FP}}_{{{\text{R}}({\text{random}})}} }},$$

where \(\mathrm{TP_{R(random)}}\) is the number of true positives in a random selection of R departures with nonzero severity, and \(\mathrm{FP_{R(random)}}\) is defined analogously for false positives.

We quantify the value of accounting for network effects by computing ranked alert lists for each leg in isolation. We then compare the true-positive rates to the aggregated, network-driven approach presented in this paper.

3.4 Detecting outliers in multiple legs

As a first experiment, we consider the scenario where outlier demand equally affects all itineraries and legs within the cluster. For this scenario, Fig. 5a illustrates how the true-positive rate (TPR) increases when ranking outliers for different lengths of the alert list. The red line indicates the number of genuine outliers. The true-positive rates for our method (denoted as FD+Agg) are promising, with a TPR of around 0.2 for a list length of 1. Since there are five genuine outliers, this indicates that a genuine outlier is almost always ranked top. Results under different functional depth thresholds are given in Appendix 2.

Fig. 5
figure 5

Performance and benchmark comparison with PCA+HDR for demand-volume outliers in all itineraries, showing improved performance

3.4.1 PCA+HDR benchmark results

The PCA+HDR approach requires a given number of outliers to detect, \(\nu\), as input. Therefore, we compare the performance of the benchmark method under different choices of \(\nu\) to FD+Agg.

Figure 5b shows that the true-positive rate achieved by FD+Agg consistently exceeds that achieved by PCA+HDR. To achieve the same level of the true-positive rate, PCA+HDR would need to classify around 250 departures (i.e. 50%) as outliers. In comparison, FD+Agg achieves this rate starting at about 30 classified outliers. We consider this a successful validation of the effect of ranking outliers in FD+Agg. Appendix 1 lists these results in tabular format.

Figure 5c shows the distribution of each outlier magnitude in the alert lists. Under FD+Agg, the modes of the distributions generally fall where they should, as larger outliers are ranked higher. The smaller variance in the ranking of the larger magnitude outliers indicates that they are easier to detect. The higher variance of the medium-sized outliers can be explained as the ranking of a medium-sized outlier is dependent on which other types of outliers occur: If there is a large and a medium outlier, the medium outlier is ranked lower; if there is a small and a medium outlier, the medium outlier is ranked higher. The distribution of outliers detected by PCA+HDR, shown in Fig. 5d, also has the modes in the correct order. However, there is much more overlap between the distributions, showing its inability to correctly rank the outliers.

3.4.2 Comparison with non-ranked approach

Figure 6a highlights how the precision improves when ranking outliers instead of listing them in random order. Ranking particularly improves precision when the alert list covers only a small number of departures. As domain experts indicate that analysts cannot target more than 1% of departures, ranking focuses resources and thereby provides large benefits in practice. Nevertheless, Fig. 6a (when contrasted with Fig. 5a) also highlights the trade-off between reducing the number of false alerts and identifying all outliers. A shorter length of alert list increases precision but reduces the true-positive rate.

Fig. 6
figure 6

Change in precision from ranking detected outliers in FD+Agg as dependent on the length of the alert list

The increase in precision from applying our method compared to PCA+HDR is similar to the increase in precision from the inclusion of the ranking (see Fig. 6b). This suggests that PCA+HDR performs reasonably well in terms of outlier detection, but poorly in terms of ranking the outliers.

3.4.3 Comparison with single-leg approach

Figure 7 shows the true-positive rate when a ranked alert list is computed for each leg in isolation versus in the proposed aggregated manner. Here, we consider outlier demand generated by a 50% increase in the affected legs as an illustrative example. We analyse detection performance by breaking down results in terms of which itinerary the outlier demand is generated in. We show only the results relating to itineraries AB, AC, AD, and AE. Figure 26 in Appendix 5 details results for the further itineraries yielding similar conclusions.

For results, when outlier demand is generated across combinations of itineraries, refer to Appendix 6.

Fig. 7
figure 7

True-positive rate for single itinerary outliers when applying FD+Agg versus detection on isolated legs

In all cases, the true-positive rate for clusters is higher than in any of the individual legs. This is because when considering the leg’s bookings in isolation under outlier demand that affects multiple legs, the noise from other itineraries prevents detecting the outlier in every leg. However, clustering increases the number of detected genuine outliers.

Aggregation is most beneficial when the outlier demand affects the most legs. In our example, this applies when itinerary AE experiences outlier demand, as shown in Fig. 7a. The lower true-positive rates in legs AB and DE result because different combinations of itineraries also utilise these legs. The aggregation is less beneficial when outlier demand affects an itinerary consisting of only one or two legs since we aggregate the analysis across legs that are actually not affected by outlier demand. However, there is a modest gain in true-positive rate even in this case—compare Fig. 7(c). This is due to the knock-on effects of decreased capacity on the affected legs, impacting the bid prices for any itineraries which include these legs. For some lengths of the alert list, the leg-level true-positive rates are higher than the aggregated approach, due to false positives from unaffected legs being included in the list. However, even for itinerary AB (Fig. 7d), where false positives from unaffected legs are most likely, the difference is small and cancelled out by the overall increase in true-positive rate.

3.4.4 Sensitivity to different magnitudes of outliers

To better understand outlier detection performance, we break down the results by the magnitude of outliers in Fig. 8.

Fig. 8
figure 8

Sensitivity of true-positive rate from FD+Agg under different magnitudes of homogeneous demand-volume outliers

When outliers result from minor changes in demand levels, they are difficult to detect, resulting in low true-positive rates. Given the significant overlap between the distribution of outlier demand with a 10% change in magnitude and that of regular demand, this is to be expected. Therefore, 10% demand changes effectively provide a lower bound on how big an outlier needs to be in order to be detected.

Fig. 9
figure 9

Sensitivity of precision from FD+Agg under different magnitudes of homogeneous demand-volume outliers

As the magnitude of the outliers increases, they become easier to detect and true-positive rates are higher, with peak rates reached with shorter alert lists. Thus, genuine outliers are more likely to be ranked higher when they are caused by larger demand changes. For demand decreases of at least 50%, the true-positive rate is very close to the optimal detection rate. Negative demand outliers are slightly easier to detect than positive demand outliers, meaning shorter alert lists are required. This is due to the demand censoring imposed by the booking controls and capacity restrictions.

Figure 9 shows the precision gap over randomly ordered lists. Once more, larger magnitude outliers result in larger precision improvements from ranking, while detecting minor outliers gains little over random selection. Similarly, we observe that detecting negative demand outliers gains slightly more precision in comparison with detecting positive outliers of the same magnitude. Additional results regarding false discovery rates are available in the appendix.

4 Simulation study: forecast adjustments

To evaluate the implications of adjusting the demand forecast for further planning steps, we simulate network demand and the optimisation of offered fare classes over the booking horizon. We list and explain all parameters determining the settings in the simulation study in Appendix 7. In this section, we first detail how the simulated RM system uses the demand forecast to compute revenue-optimal offers based on bid prices. In that, it follows a widely implemented industry standard. Subsequently, we describe alternative strategies that analysts may apply to adjust demand forecasts based on identified outliers. Finally, by comparing revenue gained from offers based on different adjusted demand forecasts under the same simulated outlier demand, we highlight the effects of adjustments as dependent on outlier scenarios.

4.1 Network revenue management system

The simulated RM system controls the offered set of fare classes per itinerary to optimise expected revenue. To that end, it implements a dynamic program to compute bid prices per leg and sums them up per itinerary following the methodology described in Strauss et al. (2018) and detailed in the appendix. To test for the sensitivity of results with regard to the revenue optimisation, we compared two industry standards, the leg-based EMSR heuristic as introduced in Belobaba (1987) and dynamic programming in initial simulations studies not further documented here. The results showed that, for the given demand model, the choice of optimisation approach had little effect on the quality of the outlier detection.

The bid price indicates the marginal difference between the value of selling a seat in the current time period and that of reserving it to sell in a future time period. The RM system only offers fare classes where the revenue from a booking exceeds the bid price. Thus, as an RM term, bid prices do not denote the customer’s bid but indicate the minimum price a fare class must carry to be included in the offer set. From those classes in the offer set, customers only consider the cheapest offer. Bid prices depend on the time until departure, unsold capacity, and expected demand. Note that in the examples given here, we consider a single capacity per leg, not differentiating, for example 1st or 2nd class compartments with separate capacities.

Booking patterns result as customers arrive and decide to book one of the offered fare classes. The firm does not report booking patterns for each individual itinerary, but only records them on the leg level.

The dynamic programme relies on a given set of expected demand arrival rates per leg l, fare class j, and time slice t of the booking horizon. In the simulation, we derive expected demand arrival rates from our knowledge of the underlying demand model. Arrival rates for each leg l and fare class j are given as

$$\begin{aligned} {\hat{\Lambda }}_{j,l}(t) = \sum _{o \in {\mathcal {O}}_l} \sum _{i \in {\mathcal {I}}} p_{i,j,o} \, \lambda _{i,o}(t), \end{aligned}$$

where \(\lambda _{i,o}(t)\) is the arrival rate of customers of type i requesting itinerary o, and \({\mathcal {O}}_l\) is the set of itineraries which include leg l. This creates an artificially accurate demand forecast. Deriving the demand forecast from the actual demand parameter values ensures that the estimation of revenue loss caused by undetected outliers is not affected by flawed forecasts (see Sect. 4.3). In practice, demand parameter values are not known but are estimated based on previously observed demand and time-series forecasting. A recent survey of related research contributions can be found in Banerjee et al. (2020), while Fiig et al. (2019) represent an example of the ongoing discussion on the link between forecast accuracy and RM performance.

4.2 Forecast adjustments for outlier demand

One aim of identifying outlier demand in booking patterns is to support analyst adjustments in RM systems. Without such adjustments, offers would be optimised for a regular demand forecast and thereby not be fit for maximising revenue under outlier demand. This raises the difficulty of predicting the consequences of analyst adjustments throughout the network. As a step in this direction, we analyse a best-case scenario, assuming that the adjustment is made with foresight before the start of the booking horizon. We compare the revenue under three different adjustments:

  • Adjustment 1 (conservative) Adjust only forecasts of affected single-leg itineraries. E.g. for an outlier creating additional demand for itinerary AC, increase the forecasts of itineraries AB and BC.

  • Adjustment 2 (aggressive) Adjust forecasts of all itineraries that include at least one of the affected legs. E.g. for additional demand for itinerary AC, adjust all itineraries, including either leg AB or leg BC—i.e. itineraries AB, AC, AD, AE, BC, BD, and BE.

  • Adjustment 3 (balanced) Adjust forecasts of affected single-leg itineraries and the cluster-spanning itinerary; in this case, AE. E.g. for additional demand for itinerary AC, adjust itineraries AB, BC, and AE. The motivation for adjusting AE (ahead of other itineraries) is that, in general, this will be the most popular itinerary in the cluster.

These three adjustments are not the only choices available to analysts. However, they represent options that stretch across the spectrum of how fully network effects should be considered. Adjustments 1 (conservative, leg-based adjustments only) and 2 (aggressive, all potential network effects) are the two extremes. Adjustment 3 (balanced) is a compromise, which is more conservative than Adjustment 2 but still identifies the itinerary most likely to be the source of outlier demand. Further options would be to include more than just the cluster-spanning itinerary in an alternative to Adjustment 3, but this leaves another choice of which itineraries to prioritise. As a lower bound, we compute the revenue when no adjustment is made. As an upper bound, we implement an oracle adjustment, i.e. only adjusting the forecasts of affected itineraries. We compare the revenue as the level of outlier demand ranges from -60% to +60% of the average leg demand.

4.3 Experimental results: revenue benefits

Figure 10 shows the revenue generated by outlier demand for each of the three adjustments. We show the results for four of ten itineraries contained within these four legs in Fig. 4. The results for the other six itineraries are similar. Appendix 12 further details these results as well as results on adjustments after outlier detection.

Fig. 10
figure 10

Revenue under under different forecast adjustments; the subtitle indicates the actual outlier source

When outlier demand affects all four legs in the cluster (Fig. 10a), any type of adjustment is always better than no adjustment. Besides the oracle, the best choice is Adjustment 3, i.e. the balanced approach, which adjusts the forecasts of the cluster-spanning itinerary and the individual leg. Adjustment 3 is able to obtain, on average, 87% of the additional revenue gained under the oracle adjustment. Similar results are obtained when the outlier demand affects three legs (Fig. 10b).

When outlier demand affects only a single-leg itinerary (Fig. 10d), the conservative Adjustment 1 and the oracle adjustment coincide. The aggressive Adjustment 2 yields less revenue than no adjustment. For example although leg AB is correctly adjusted, the erroneous adjustment to itineraries AC, AD, and AE results in incorrect forecasts for legs BC, CD, and DE. The asymmetry between adjustment to positive and negative outlier demand is due to the level of demand being bounded from below by 0. Similar results emerge when the outlier affects only two of the affected legs (Fig. 10c), though the negative consequences of over-adjusting all potentially affected itineraries are less severe, as this causes fewer superfluous adjustments.

The negative impact of adjusting unaffected itineraries highlights the importance of correctly clustering legs ahead of outlier detection. The closer the outlier demand itinerary is to the cluster-spanning itinerary, the less risky it is to adjust all affected itineraries within a cluster, and the more benefit can be gained from doing so. From a managerial perspective, the best adjustment (other than the oracle) depends on the firm’s objective. To maximise revenue when the most common outlier (e.g. itinerary AE) occurs, the balanced Adjustment 3 is preferable. Conversely, if the objective is to minimise risk to revenue even in the more unlikely scenarios (e.g. an outlier in itinerary AB), conservative Adjustment 1 is preferable. Overall, however, there are clear benefits from forecast adjustment.

5 Empirical study

To demonstrate the practical applicability of the proposed clustering and outlier detection, we apply it to a set of empirical data obtained from Deutsche Bahn. This data set features only bookings of the 2nd class compartment, such that all bookings on one leg require capacity from the same compartment. The Deutsche Bahn long-distance network consists of over 1000 train stations, letting the provider offer more than 110,000 direct origin—destination combinations. The numbers grow further when accounting for alternative transfer itineraries and multiple daily departures. Figure 11 shows the empirical distribution of the number of legs included in itineraries that passengers booked in November 2019. Only 7% of passengers booked single-leg itineraries, whereas almost half of all booked itineraries span five or more legs.

Fig. 11
figure 11

Distribution of the number of legs per booked itinerary from Deutsche Bahn data

5.1 Clustering legs in the Deutsche Bahn network

5.1.1 Small network subsection

First, we consider a section of the Deutsche Bahn railway network that consists of two intersecting train lines over a total of 27 stations and 28 legs—see Fig. 12. The red train arrives at the connecting stations before the blue train. Hence, the network offers three transfer connections: changing from red to blue at either Fulda, Kassel-Wilhelmshöhe, or Göttingen. This creates 240 potential travel itineraries. For each leg in this network section, Deutsche Bahn records 359 booking patterns for departures between December 2018 and December 2019. Each booking pattern ranges over 19 booking intervals; the first observation occurs 91 days before departure.

We first apply the correlation-based clustering approach of Sect. 2.1, using a threshold of 0.5, such that only legs with a minimum correlation of 0.5 can be in the same cluster. In Fig. 12a, coloured bubbles indicate the four resulting clusters: Each train line splits into one large and one small cluster.

Fig. 12
figure 12

Comparison of correlation-based and rule-based clustering of Deutsche Bahn network

To evaluate clustering on empirical data, where the true underlying demand for each itinerary is unknown, we use the network topology to check whether the resulting clusters are plausible. To that end, we propose the following set of rules:

  • Different train lines must belong to different clusters. Even when passengers can transfer between lines, we expect relatively few passengers to make the same connection. Further, it makes sense to consider train lines separately for forecasting and analyst interventions.

  • Train lines are further split into separate clusters on either side of a major station. As many passengers leave the train at a major station and many different passengers board, we shall assume a relatively small proportion of passengers book itineraries that pass a major station. Similarly, given that itinerary demand share is driven by which journeys are most common, and passengers often either board or alight at a major station, it is intuitive to have a cluster that contains the legs between major stations.

Deutsche Bahn assigns an ordinal indicator of importance to each station, ranging from 1 to 7. We define a major station to be in Category 1. The entire Deutsche Bahn network includes 21 major stations, whereas the considered network section includes nine major stations. Figure 12b highlights major stations in grey and shows the clusters resulting from the above rules.

The correlation-based clustering returns four clusters, whereas the rule-based clustering returns nine. Nevertheless, the resulting clusters share similar features. Firstly, the two distinct train lines end up in different clusters in either approach. For legs in distinct train lines, correlation tends to be higher between legs that share a transfer station, but not to a convincing extent—the correlation is at most 0.22. A correlation threshold of 0.27 creates two clusters (one for each train line). Secondly, the breakpoints for the correlation-based approach are a subset of the breakpoints, i.e. major stations, in the rule-based approach. We conclude that the correlation-based approach achieves similar results as the rule-based approach without requiring expert input.

Fig. 13
figure 13

Comparison of rule-based and correlation-based clustering in a two-line railway network

We can formally compare clustering results using the Normalised Mutual Information (NMI) (Amelio and Pizzuti 2015). The NMI is 1 if two clusterings are identical, and 0 if they are completely different.

Figure 13a shows the NMI between the correlation- and rule-based approaches while varying the threshold in the correlation-based approach from 0 to 1. This shows that both approaches achieve similar results, with an NMI reaching 0.899. The approaches are generally more similar at higher correlation thresholds (around 0.7) since the rule-based approach generally creates more clusters. Figure 13b compares the number of clusters of the two approaches—as the correlation threshold changes, the number of clusters ranges from 1 (everything in a single cluster) to 28 (each leg in its own cluster), demonstrating the flexibility of the correlation-based approach.

5.1.2 Large network subsection

We extend the empirical study to five train lines to further demonstrate the complexity that considering the network structure brings to clustering and outlier detection, and show the scalability of the approach. The five-line network consists of 40 stations with 63 legs. As shown in Fig. 14, there are often multiple train lines which cover the same leg or may travel in the opposite direction. As the larger size of the network makes visualisation more difficult, in Fig. 14, stations are represented by circles, with major stations highlighted in black.

Figure 14(a) shows the results of the correlation-based clustering with a default threshold \(\rho = 0.5\). This results in nine clusters, with two train lines each forming their own cluster containing all legs. The breakpoints of the clusters occur at major stations, as also previously seen for two train lines. The pattern of breaking clusters at major stations persists as the correlation threshold is varied. In comparison, the output of the rule-based clustering shown in Fig. 14(b) results in 24 clusters, with many being of size 1.

Fig. 14
figure 14

Comparison of rule-based and correlation-based clustering in a five-line railway network

In these empirical studies, we applied rule-based clustering only to evaluate the plausibility of the results from correlation-based clustering. We do not advocate for it as a method in itself. A rule-based approach, where the clusters are based on domain experts’ categorisations, would not be able to respond to the evolving importance of stations across different train lines and departure times. Notably, the correlation-based method not only uncovers major stations but rather identifies legs where multi-leg itineraries cause similar booking patterns and thus could change and adapt over time. We further evaluate clustering performance in a simulation study, where the itinerary-level demand is known, in Appendix 8. The results in the remainder of the paper rely on correlation-based clustering.

5.2 Detecting outliers in the Deutsche Bahn data

Having established clusters, we apply outlier detection independently to each cluster. To exemplify this on empirical data, we apply the outlier detection procedure to a representative four-leg cluster from the Deutsche Bahn network. Applying the proposed outlier detection approach to empirical data cannot precisely judge detection accuracy, given there are no labelled data on genuine outliers. However, this analysis demonstrates the full process of outlier detection on empirical data including, e.g. seasonality and underlines practical implications.

For this analysis, we consider a cluster of four legs from the Deutsche Bahn network with stations anonymised and denoted by A, B, C, D, and E. This cluster results from applying the correlation-based clustering to a new section of the Deutsche Bahn network to Fig. 12.

Figure 15 shows the booking patterns for each of the four legs; bookings are scaled to be between 0 and 1. From initial visual inspection, the structure of the booking patterns appears similar, with some obvious outliers appearing across multiple legs.

Fig. 15
figure 15

Booking patterns for each leg

To pre-process the data for outlier detection, we transform the booking patterns by applying a functional regression model (Ramsay and Silverman 1997). We then apply the outlier detection to the residual booking patterns. In this pre-processing, we correct for three factors: (i) the departure day of the week; (ii) the departure month of the year; and (iii) the length of the booking horizon.Footnote 1

The functional regression fits a mean function to the booking patterns for each different factor in the model. Table 9 in Appendix 13 compares models including different factors. Let \(y_{nl}(t)\) be the \(n^{th}\) booking pattern for leg l. Then:

$$\begin{aligned} \begin{aligned} y_{nl}(t) = \beta _{0l}(t) + {\beta _{1l}(t)\mathbbm {1}_{Mon_{nl}} + \beta _{2l}(t)\mathbbm {1}_{Tue_{nl}} + \beta _{3l}(t)\mathbbm {1}_{Wed_{nl}} +} \\ \underbrace{{\beta _{4l}(t)\mathbbm {1}_{Thu_{nl}} + \beta _{5l}(t)\mathbbm {1}_{Fri_{nl}} + \beta _{6l}(t)\mathbbm {1}_{Sat_{nl}}+}}_\text {{Departure Day of the Week}} \\ {\beta _{7l}(t)\mathbbm {1}_{Jan_{nl}} + \beta _{8l}(t)\mathbbm {1}_{Feb_{nl}} + \beta _{9l}(t)\mathbbm {1}_{Mar_{nl}} + } \\ {\beta _{10l}(t)\mathbbm {1}_{Apr_{nl}} + \beta _{11l}(t)\mathbbm {1}_{May_{nl}} + \beta _{12l}(t)\mathbbm {1}_{Jun_{nl}} + \beta _{13l}(t)\mathbbm {1}_{Jul_{nl}} + } \\ \underbrace{{\beta _{14l}(t)\mathbbm {1}_{Aug_{nl}} + \beta _{15l}(t)\mathbbm {1}_{Sep_{nl}} + \beta _{16l}(t)\mathbbm {1}_{Oct_{nl}} + \beta _{17l}(t)\mathbbm {1}_{Nov_{nl}} +}}_\text {{Departure Month of the Year}} \\ \underbrace{{\beta _{18l}(t)\mathbbm {1}_{{\rm Shorter\,\,Horizon}_{nl}}}}_\text {{Length of Booking Horizon}} + e_{nl}(t). \end{aligned} \end{aligned}$$

where, e.g. \(\mathbbm {1}_{Mon_{nl}} =1\) if departure n relates to a Monday, 0 otherwise. In this model, \(\beta _{0l}(t)\) represents the average bookings for Sunday departures in December, with a regular length of booking horizon, and \(\beta _{pl}(t)\) for \(p>0\) represent deviations from this mean pattern. The \(\beta _{pl}(t)\) are functions of time, which allows for relationships between factors to evolve over the booking horizon. Given that functional depths are calculated independently for each leg, we apply the regression model independently for each leg. The resulting residuals are included in Appendix 14, Fig. 38.

Fig. 16
figure 16

Threshold exceedances per leg, \(z_{nl}\)

Functional regression preserves the correlation between different legs, as verified in Appendix 19, Table 11b. The clustering approach can consider either the correlations between the booking patterns or the residual booking patterns. Given that the functional depth (the basis for the outlier detection) is calculated on the residuals, we suggest using the correlation between residual patterns to define the clusters. For this data set, the same clusters resulted in either case.

We calculate the functional depth of each booking pattern and compute the threshold as described in Sect. 2.2. We then transform the depths as per equation (2) to obtain \(z_{nl}\), as shown in Fig. 16. The sums of threshold exceedances, \(z_n\), were shown earlier in Fig. 2, with the empirical distribution and fitted generalised Pareto distribution shown in Figs. 3a and 3 b, respectively.

Figure 17 highlights the outliers detected in each leg in pink while depicting outliers detected in other legs but not in that leg in blue. Regular patterns are grey.

Fig. 17
figure 17

Outliers detected in booking patterns

Of the 40 outliers (11% of departures) detected across all legs, 23 outliers (almost 60%) could be attributed to known events or holidays. When considering only the top 10 outliers, the percentage rose to 70%. A further departure detected as an outlier had been previously flagged by Deutsche Bahn. The firm implemented a booking stop to control sales on that departure for multiple connected legs. Appendix 18 provides further details on the distribution of identified outliers across legs.

6 Conclusion and outlook

In this paper, we proposed a two-step method for (i) clustering legs in a mobility network that could benefit from joint outlier detection and (ii) detecting outlying demand within such clusters. Furthermore, the proposed method, FD+Agg, ranks identified outliers according to their severity, creating an alert list to aid analysts in prioritising demand forecast adjustments.

The simulation study demonstrated the robustness of the method in a range of outlier demand scenarios. It highlighted that aggregating the analysis across clustered legs improves both detection rate and precision. Further, the ranked alert list often correctly identified the most critical outliers. The advantages of the proposed approach became particularly clear when benchmarking its true-positive rate, distribution of outliers across ranks, and precision, against that from a combination of principal component analysis and high-density regions (PCA+HDR) from Hyndman et al. (2016), and on the non-ranked, leg-based method proposed in Rennie et al. (2021).

Furthermore, we implemented a simulated revenue management system to measure the potential revenue benefits of identifying and adjusting for demand outliers in a network setting by applying forecast adjustments across a cluster of legs. This analysis showed that taking into account, the similarity of the legs can improve revenue in most scenarios. In the less likely scenario where only one or two legs of a cluster are affected by outlier demand, risk-averse firms may prefer individual leg-level adjustments.

Finally, by applying the proposed approach to empirical booking data collected by Deutsche Bahn, we demonstrated its applicability and scalability to the type of data observed in practice. In particular, we used this analysis to showcase the expected cluster results and to demonstrate how to account for additional practical considerations, such as trend and seasonality. Note that once the clustering has been performed, the outlier detection can be performed in parallel within each cluster. Therefore, our methodology is scalable to a much larger data set, such as the entire Deutsche Bahn long-distance train network. Such an analysis is not included in this paper as, beyond giving excessive insight into confidential company data, the research insight to be gained from visualising even more complex network cut-outs is limited.

The remainder of this section discusses design choices taken in the research documented here, related limitations, and open research challenges.

Leg- versus itinerary-level data: Our proposed method aggregates and analyses booking patterns from legs instead of itineraries based on three considerations. First, when an extensive network features many possible itineraries, most individual itineraries only receive a small share of bookings, challenging any data analysis—the study described in Appendix 11 evaluates such a case. Though the outlier detection may perform well if there are a sufficient number of bookings for a given itinerary, only considering such itineraries risks systematically ignoring outliers from smaller itineraries and feeder legs. Secondly, when offering many potential itineraries, providers rarely store all booking patterns per itinerary. For example capacity-based RM, as described in Strauss et al. (2018), frequently considers leg booking patterns to ensure capacity availability on each leg of a requested itinerary. Accordingly, the methodology proposed here is compatible with capacity-based RM. Finally, even in the idealised case of having large volumes of stored itinerary-level data for every possible itinerary in the network, then running outlier detection algorithms quickly becomes computationally infeasible as the number of possible itineraries grows rapidly with the size of the network. Detecting outlying clusters of legs, rather than individual itineraries, overcomes all three challenges, as we have demonstrated in this paper. We do, however, note that the outlier detection methodology we propose could be applied directly to itinerary data without performing clustering. However, we only recommend this for densely booked itineraries, as otherwise, zero-inflated data can induce inferior results. We explored this further in Appendix 11.

Constrained versus unconstrained bookings: Observed bookings are constrained by any revenue management controls that were in place at the time of booking, whereas revenue optimisation models rely on unconstrained demand forecasts (Talluri and Van Ryzin 2004, Chapter 9.4). To represent this practice, we analysed constrained bookings in this paper and analysed the effect of adjusting unconstrained forecasts in the computational study. In that vein, further research could also consider the impact of applying the analysis to constrained observations, as showcased here, versus applying it to unconstrained demand estimates, which are frequently used for demand forecasting.

Implications for decision support: Further research is needed to consider the practical aspects of outlier detection from the perspective of decision support. Outliers manifest as changes in arrival rate, price elasticity, or other variables that affect bookings. Outliers can be caused by stochasticity but also by changes in demand patterns as a result of external factors, such as specific events. Complemented by further analysis, successful outlier detection could have three potential uses for RM: 1) detecting outliers early within the booking horizon through online analysis as proposed in Rennie et al. (2021), allowing for rapid interventions; 2) removing any detected outliers from training data for demand forecasting to improve results on predicting reference demand curves; and 3) if outliers can be attributed to specific events, the forecast model could be extended to include such events. Outlier detection can have broader benefits for operational planning in transportation networks, helping service providers to avoid overcrowding and delays. To realise such benefits, future research should particularly focus on effective ways to visualise outliers in networks and to communicate alert lists to planners. To further support analysts in their decision-making, additional measures could be included in the alert list. These might include average fare in the affected cluster, potential revenue loss if the outlier is not accounted for, or the outlier severity resulting from running the outlier detection procedure on revenue (instead of booking) patterns. An interesting avenue of further research would be to incorporate a feedback element whereby analysts mark outlier alerts as useful or not useful. A supervised learning approach, e.g. one-class classifiers, could then be combined with our proposed outlier detection routine to filter out false alerts. Analysts could additionally include feedback on the quality of the clustering approach.

Clustering methodology: Investigating the use of alternative clustering approaches is of interest—especially where the clusters are likely to be of different structures compared to the rail industry, e.g. in the airline industry where hub and spoke networks are more common than lines. Whilst this paper relied on clustering to improve outlier detection, we believe that the clustering approach is a useful contribution in and of itself. For example clustering presents additional research avenues such as its application to improving network-level forecasting; supporting the planning for future new stations; and evaluating how the transport network structure is changing over time or defining different travel zones. Finally, further research opportunities lie in considering how the success of network outlier detection depends on the network structure. This paper featured examples from transport, specifically railway networks. Other application areas of RM, such as hotels, where correlation is induced by bookings for multiple consecutive nights, feature sparser or structurally different service networks.