1 Introduction

Intelligent Transportation Systems (ITS) play a key role in making transport services more predictable, improving their safety and reducing their costs and emissions. These systems use data collected from a wide range of in-vehicle sensors and other context-specific sources (e.g. scheduled services, movement flows, traffic congestion, weather, etc.) to monitor, analyse and improve transport operations.

The role of ITS is particularly interesting in the case of air traffic management (ATM), a sector that has almost recovered pre-pandemic traffic levels after the severe impact of the air traffic restrictions caused by COVID-19 [1]. This return to normality also brings back old problems, like flight delays, one of the common issues in this sector. A recent report of EUROCONTROL [2] shows that more than 35% of flights in Europe arrived at least 15 min. late with respect to their scheduled arrival time, what has a huge impact in terms of costs, emissions, and passenger satisfaction. This report also identifies reactionary delays (those caused by previous flights) as the main cause of flight delays, thus stressing the importance of having accurate predictions of the time of arrival to plan ahead and minimize cascading effects. Additionally, other sources of delays include airline or airport operations and weather conditions, among others. These situations increase the complexity of ATM operations, especially around airports, where air traffic controllers must monitor and handle incoming and outgoing flights, while making efficient use of the available resources and ensuring safety. To do this effectively, controllers must have a good understanding of the current situation and, when possible, predict ahead the future conditions of the airspace.

Flight plans have been one of the main information sources for air traffic control. Any flight taking place in European airspace must file its flight plan, indicating the intended flight path (in the form of waypoints, i.e. relevant geographic locations to describe the flight path), as well as the scheduled departure and arrival times, and other relevant information for planning purposes [3]. Flight plans are communicated up to one week before the start of the flight, but may be modified or amended at any time, enabling tactical decisions to be made in terms of resource allocation, estimation and meeting of the needs of all stakeholders (airlines, authorities and customers) and security assurance. In practice, flight plans provide only a rough description of the expected flight path based on forecast conditions that may change during the flight. However, in-flight updates are not frequent and are mainly due to significant changes in the flight plan, such as long delays at any point in the flight or diversions due to bad weather. This still leaves a high level of uncertainty for pilots and air traffic controllers, who often have to make decisions in the event of sudden changes. This has been reported as a potential factor of risk, given the workload and the pressure the controllers are under [4].

Surveillance systems are also used to assist air traffic controllers, by providing them with the position of the aircraft throughout the flight. ADS-B (Automatic Dependent Surveillance-Broadcast) [5] has progressively replaced secondary radars for this purpose, taking advantage of the aircraft’s capabilities to determine its position as well as other important flight parameters (altitude, speed, bearing, etc.), which are continuously emitted by the vehicle. ADS-B equipment is mandatory for aircraft operating commercial flights in the world’s major airspaces and plays a key role in introducing the concept of Trajectory-Based Operations (TBO) [6] in intelligent ATM systems. TBO go beyond decision making based on flight plans thanks to the notion of 4D trajectory, which integrates time into the 3D (latitude, longitude, and altitude) flight path [7]. Trajectories are thus described in terms of position and time and are agreed upon by all involved stakeholders to allow for better allocation of airspace and airport resources. Therefore, 4D trajectories enable flight delays to be considered as deviations from the expected trajectory, in the same way as changes in horizontal positions or flight levels, contributing to understand these deviations and improve the predictability of ATM operations.

This paper explores how 4D trajectories can be used to improve predictions of the estimated time of arrival (ETA). ETA is a major factor for ATM operations, because it determines when a flight will arrive at the destination airport, allowing for efficient resource allocation in the transit airspace and at the airport. In the context of this work, ETA is defined as the estimated time until a flying aircraft lands at the destination airport, that is, until touchdown. Therefore, our study does not include taxiing times at the origin and destination airports. Most current research predicts ETAs in the Terminal Manoeuvring Area (TMA) [8,9,10,11,12,13], as this is where some of the most critical ATM operations take place, but these predictions can be valuable at any point along the flight path to cover as much airspace as possible. On the other hand, some studies [14] take an individual approach, looking at a single route or segmenting the traffic according to different criteria. However, as flights approach the airport, the traffic coming from the different routes becomes more homogeneous when performing the approach manoeuvres defined for that airport. The behaviour of the aircraft may also be similar during the flight, especially during the cruise phase. In other words, learning from multiple routes should help to identify and model increasingly rich flight patterns, rather than focusing on single routes that require, on the other hand, training and maintenance of specific models, with the additional costs that this entails.

Our approach treats flight trajectories as time series (including the four dimensions mentioned above and some other features from surveillance, flight plan and weather data) and trains a deep learning model to make ETA predictions for incoming flights to a given airport. In particular, we designed an architecture based on Long-Short Term Memory (LSTM) neural networks [15] to leverage flight dependencies in the long and short terms of the flight that influence on the accurate estimation of its time of arrival. This architecture combines data relative to surveillance, flight plans and weather conditions at the destination airport, to characterize the flight state and predict an accurate estimated time of arrival based on the actual conditions in which the flight is taking place. We apply a global approach, unlike other proposals that define a single model for each pair of origin and destination airports, to leverage the similarities between trajectories departing from different airports.

A comprehensive evaluation is conducted to analyse the performance of our proposal on the basis of the different parameters that characterise it and to compare its results with a selected baseline, which includes prominent comparable solutions in the state of the art. Our approach is evaluated at the Adolfo Suárez-Madrid Barajas airport, using incoming flights (from 40 different airports) during the first three quarters of 2022, reporting a mean absolute error (MAE) of 2.65 min and a root-mean-squared error (RMSE) of 4.30 min over the entire flight, for all of the routes considered in this study. These results demonstrate that LSTM is a viable approach to ETA prediction in ATM and can surpass other techniques that are the state-of-the-art at this task, such as ensemble and boosting machine learning methods. This paper also demonstrates how a global model can outperform individual models that are specific for a single route. Our experiments show that including trajectories from different routes improves the robustness of the model for each of those routes, with generalized improvements along the whole route.

In summary, this paper makes three main contributions:

  • A novel approach to estimating the time of arrival at any point along the flight path, based on surveillance data and taking into account the weather conditions at the destination airport.

  • An effective LSTM-based architecture that leverages the similarities between different routes arriving at the destination airport to provide more accurate results than specialized, state-of-the-art individual models.

  • A case study of European international flights arriving at Madrid Barajas-Adolfo Suárez (Spain), an airport that has not yet been studied in the literature, despite its high traffic volume.

The rest of the paper is organized as follows. Section 2 gives a broad picture of the estimated time of arrival problem, and Sect. 3 provides the basic background to understand our approach. Section 4 describes the selected data that we use to make ETA predictions using the LSTM-based architecture presented in Sect. 5. Section 6 describes our case study and the process of generating the dataset used in our experiments, which are carefully presented and analysed in Sect. 7. Finally, Sect. 8 presents our main conclusions and devises our lines of future work.

2 Related work

Accurate prediction of the estimated time of arrival (ETA) is crucial to reduce costs and environmental impact of flights, and, at the same time, to improve its safety, capacity and efficiency [14]. Moreover, an inaccurate prediction of the ETA of a flight can have a cascading effect which, in turn, may have an impact on the arrivals of other flights that will have to wait until the necessary resources are available for landing. However, due to all the nondeterministic events that can occur during a flight, providing accurate predictions of the ETA is a challenging task.

ETA prediction for commercial flights was originally addressed using deterministic methods [16], based on aircraft performance and physics simulations. The basic idea of these methods is to compute a reference flight trajectory and then calculate the time needed to fly it. Although effective, these methods are complex and expensive to develop, and their predictions are highly dependent on the considered simulation conditions (weather, traffic congestion, etc.), so predictions will be inaccurate if these conditions do not hold during the flight.

On the other hand, data-driven approaches have gained importance in recent years, due to the increased availability of air traffic-related data and hardware resources capable of running computationally intensive machine learning algorithms. These methods leverage historic data to learn hidden patterns and are better suited to adapt to unseen or rare circumstances, providing more accurate predictions in the uncertain conditions that govern ATM operations. Thus, data-driven approaches generalize better than fine-tuned deterministic ones and are therefore currently the most appropriate choice to address the problem at hand.

Most of the existing data-driven approaches [8,9,10,11,12,13, 17] build a prediction model for a particular arrival airport, enabling ETA predictions for all incoming flights, regardless of their departure airport. All of them report figures for short-term ETA predictions, when the aircraft is close to the destination airport in terms of distance (mostly between 25 and 100 nautical miles, or NM, away from the airport) or time (between 5 and 60 min before landing), but do no report numbers at the early stages of the flight. In [18], a single model is proposed to predict all flights within a given area, but it is suggested that individual models for each destination airport (or groups of related airports) would be more effective. Ayhan et al. [14] follow this approach and build optimized models for each particular route (i.e. a pair of departure and arrival airports), enabling long-term accurate ETA predictions, at the price of a complex deployment that involves maintaining multiple models (one per route) at the destination airport.

A second aspect to consider is the data used to construct the prediction models. This decision must take into account the many factors that can affect the operation of a flight. Surveillance information (mainly ADS-B) is present in most of the proposals, allowing to describe enriched 4D trajectories, where time and situation (latitude, longitude and altitude) are enhanced with other valuable features, such as speed or heading, to characterize the aircraft movement over time. Weather data (wind direction and speed, visibility, etc.) is also present in most of the solutions, but they differ in whether they consider this information only at the arrival airport [8, 11, 12, 17, 18], also at the departure airport [19] or throughout the flight [14]. Flight plan data, reporting origin and destination airports, the (scheduled and actual) off-block and takeoff times (if the flight has already started), the scheduled arrival time or the expected total duration of the flight, among other features, are also commonly used [10, 11, 13, 14, 17,18,19]. Finally, seasonality information [8, 12, 17, 19] (e.g. the day of the week, the month or the time of the day in which the flight will occur), congestion information [12, 14, 18, 19] (traffic density metrics at the airports or airspace levels) or information about resource management [8, 11] (e.g. configuration of runways) have proven to be useful for ETA prediction.

The main difference between the existing state-of-the-art solutions lies in the method they used to predict ETAs. Various machine learning methods have been evaluated for such purpose, highlighting bagging ensemble models, such as random forests (RF) [20] or Extra-Trees (ET) [21], and boosting methods, such as Gradient Boosting Machines (GBM) [22] or Adaptive Boosting (AB) [23]. More recently, Feed-Forward Neural Networks (FFNN) [24] and other deep learning models, such as Long Short-Term Memory (LSTM) networks [15], have been used with varying degrees of success.

Glina et al. [8] propose a RF-based model (called Quantile Regression Forest), which provide short-term predictions (between 3 and 60 NM) with RMSE values between 0.33 and 1.25 min, for flights arriving at Dallas/Fort Worth International Airport (ICAO code, KDFW). Kern et al. [18] also use random forests to predict ETAs for domestic routes in the USA and report a MAE reduction of 42.7%, compared to the ETA prediction provided by the Federal Aviation Administration (FAA) system. Kim [17] uses linear and median regression and a nonparametric additive model to predict ETAs for incoming domestic flights at the Denver International Airport (KDEN). In this case, the models were trained on 2010 data and then predictions were made for flights in 2011, reporting a mean absolute deviation of 8.63 min and concluding that departure delays are the most important factor for improving predictions. Dhief et al. [11] compare RF, ET and GBM models, at the Changi Airport (WSSS), concluding that ET performs better than the other methods, and reporting a RMSE of 1.92 min at 100 NM from the destination airport.

Subsequent publications demonstrate the superiority of GBM over bagging methods, at different time horizons. In [13], a GBM-based model is used to predict ETAs for incoming flights to the Malpensa-Milan Airport (LIMC), reporting RMSE values of 175 s and 304 s, at 20 and 60 min from the arrival airport, respectively. Chen et al. [12] compare GBM, RF and FFNN predictions at different distances from the Zurich airport (LSZH), concluding that GBM performed slightly better than RF and reporting RMSE values of 3.16 and 4.75 min, at 45 and 250 NM of distance, respectively. An ensemble “stacked” model is proposed in [10], reporting slightly better figures than GBM (and other methods) for ETA prediction at the entry point of Terminal Manoeuvring Area (TMA) of the Beijing Capital International Airport (ZBAA). Achenbach et al. [19] also propose an ensemble model, which combines GBM and linear regression. This ensemble performs better than its constituent models, reporting effective predictions at long term (RMSE of 5.9 min at departure), for flights flown by the A320 European fleet arriving at two airports. Ayhan et al. [14] make an exhaustive comparison of machine learning methods for ETA predictions on 10 major flight routes in Spain, including LSTM for the first time. In this case, GBM and AB report the best numbers, within 4 min of RMSE on average, regardless of the flight length. It is worth noting that LSTM reports unstable numbers in this evaluation, providing the most accurate prediction for a given route and reporting twice as much error as AB on another. Recently, Ma et al. [9] proposed a spatio-temporal neural network model that reports comparable numbers to a LSTM-based approach, for incoming flights to the ZBAA airport.

Table 1 Summary of data-driven approaches for ETA prediction

Table 1 summarizes the main features of the reviewed approaches: the scope of their models and the machine learning method they used (RF: random forests, LR: linear regression, GBM: gradient-boosting machines, AB: adaptive boosting, ET: extra trees, and LSTM: long short-term memory neural networks); the data used to build these models are then displayed (Su: surveillance, W: weather, FP; flight plans, Se: seasonality, C: congestion and R: resources); and the flight points where the ETA is predicted. It is worth noting that the last row also describes our proposal in the same terms, for comparison purposes.

3 Background

Estimating time of arrival based on 4D-trajectory data can be intuitively approached as a sequence modelling problem. Each trajectory is described by multiple sequences of values, where each value depends on the previous values in the sequence. Moreover, these data points have time information associated with them, which allows us to interpret them as a time series problem. As such, the analysis of these data using deep learning can be tackled using different types of Recurrent Neural Networks (RNN). In this section, we succinctly describe how RNN work, focusing later on the Long Short-Term Memory (LSTM) architecture, which is used in this paper.

Fig. 1
figure 1

LSTM unit internal structure

3.1 Recurrent neural networks

In recent years, methods based on Recurrent Neural Networks (RNN) have shown good performance in time series modelling tasks, provided that they are able to capture temporal dependencies in sequential data [25]. In contrast with traditional feed-forward neural networks, in which each layer passes information only to the next layer, recurrent layers present cycles within them, that is, they have links between the neurons in the layer. This fact enables RNN to have “memory” from the elements that have already been processed. A simple recurrent layer contains a single recurrent neuron that expects a sequence of elements as input. Recurrent neurons contain a cell state that changes after processing each input element and is used to process the next input element. The layer iterates on every element in the sequence: at each step, cell state is propagated from the previous iteration and used to process the next element in the sequence. Thus, the layer has “memory” of the elements that were processed before in the input sequence. When all elements in the sequence have been processed, the final output is passed to the next layer.

This structure causes RNNs to form very deep structures that increase the risk of vanishing gradient problem [25], particularly when analysing sequences with long-distance dependencies. Vanishing gradient problem consists on the error becoming too small or zero during the back-propagation step in model training. If errors are zero, the parameters of the model are not updated (their value does not change); this translates into a part of the model is not “learning” correctly, or taking a lot of time to learn long-term dependencies in long sequences. This problem can be handled using different techniques, but Long Short-Term Memory networks in particular have attracted much attention in recent years.

3.2 Long short-term memory

Long Short-Term Memory (LSTM) [15] is a gated recurrent network architecture that ensures error propagation even in deep recurrent layers, allowing the model to have “long-term” memory without the loss of “short-term” memory shown by traditional RNN. Inside a LSTM cell, three multiplicative units are defined that act as gates with different purposes: forget gate, input gate and output gate (see Fig. 1). The forget gate determines the extent to which the output of the previous iteration is used to process the next input element. The input gate controls how much information from the input element will contribute to the hidden state. This gated unit protects the hidden state from perturbations and irrelevant elements in the input sequence. The output gate outputs the most relevant parts of the hidden state, once it has been updated. This helps to filter the information to be passed on to the next iteration, avoiding the propagation of irrelevant information from the current hidden state. When the last element in the sequence has been processed, this output is passed to the next layer in the neural network. During this process, hidden state and output are updated separately, which helps to ensure long-term memory.

LSTM networks have already proven to be effective in dealing with air traffic data in different problems in the ATM field. In [26], LSTMs were combined with convolutional neural networks to predict the aircraft type using ADS-B data. Different aircraft types present particular patterns in their displacement, which can be identified in sequences of surveillance data. LSTM can analyse the progression of the latitude and longitude values, among other features, to classify the flight according to the aircraft type. Beyond RTA prediction, other regression problems have also been tackled with the use of LSTM. Shi et al. [27] proposed an architecture for trajectory prediction based on ADS-B data. Their results showed that the last reported positions of an aircraft, among other features of interest for this task, allow accurate prediction of its future positions.

Gated Recurrent Units, or GRU [28], reduced the number of gates in LSTMs by merging forget and input gates into a single “update gate”. This simplification often leads to small improvements in training times since the architecture has fewer parameters to train. However, the performance difference between GRU and LSTM units depends mostly on the problem.

4 Data description

This section describes the data used in our approach, according to the experiences reported in the related work. We organize the corresponding features in three main groups: surveillance, flight plans and weather, which are summarized in Table 2.

Table 2 Description of the features extracted from data to train the proposed models

4.1 Surveillance data

Surveillance data are used to describe the flight status over time. Each data point has an associated timestamp (assigned by the ground receiver at the reception time) and contains information about the 3D-position of the aircraft (longitude, latitude and altitude), the instant speed (both horizontal and vertical) and the direction the aircraft is heading (track). Information about the airline that operates the flight is extracted from the callsign (ID code of the flight). Additionally, we calculate a distance feature, which is the Haversine distance of the aircraft with respect to LEMD airport.

Fig. 2
figure 2

ADS-B messages are broadcasted from the aircraft and captured by receivers on the ground. Overlapping of coverage areas of different receivers can result in data duplication, while lack of coverage can result in data losses

We use surveillance data provided by OpenSky [29], an open, community-based network of receivers with great coverage of the European airspace. ADS-B messages are broadcasted from the aircraft and received by ground stations, as shown in Fig. 2. Ideally, the broadcasted message should be received by a specific ground station without incident, but two main problems can arise: (i) data duplication, when the same message is received by more than one ground station, and (ii) data loss, when ADS-B messages are not received due to lack of coverage in a particular region. OpenSky post-processes ADS-B messages to deal with some of these issues and convert them into state vectors [5], which preserve the most important surveillance information (identification, position and speed) of the aircraft, and assigns the corresponding flight callsign. However, there are still irregularities in the resulting data. To further reduce them and improve data quality, we perform additional data processing tasks, which are described in Sect. 6.1.

4.2 Flight plan data

Flight plans provide us with scheduling data such as the expected times of departure and arrival, the actual times of departure and arrival (which are calculated after the end of the flight) or the departure airport. We use this information to calculate the delay of departure, which is obtained as the difference between the scheduled time and the actual time of the takeoff, and two seasonality features, to exploit daily and weekly time patterns: (i) the day of week, in which the flight is scheduled to end; and (ii) the time of day [19], that describes the hour range the aircraft is scheduled to arrive; based in our observations, we consider three periods: morning (7–13 h), evening (13–20 h) and night (20–7 h).

We use the EUROCONTROL Network ManagerFootnote 1 as source of flight plans. On the one hand, it provides the Flight Plans feed, which publishes (i) plans for future flights, including pre-flight scheduling information (such as planned departure and arrival times), airline, origin and destination airports and estimated flight time, and (ii) modifications with respect to a previous version of a flight plan, for flights not yet departed. On the other hand, the Flight Data feed provides information during and after the flight has departed, such as the actual departure and arrival times, or any significant changes with respect to the flight plan. We identify the last version of the flight plan (which contains the most up-to-date information) to extract the features explained above.

4.3 Weather data

Weather data are used to characterize the expected weather conditions at the destination airport for the flight’s estimated time of arrival. The most relevant features are those related to wind condition (direction and speed), because these factors determine the direction of approach the aircraft must take, and influence its speed and manoeuvres. Other selected features describe temperatures, visibility conditions and sky conditions.

In this case, we use forecast reports from weather stations located at the destination airport (TAF, or Terminal Aerodrome Forecast). These reports describe the weather conditions expected in the surroundings of the airport over a period of time (typically, for the next 24 h), which is accordingly subdivided into smaller time periods when changes in conditions are expected. Weather forecasts include data about wind (direction and speed), temperatures, precipitation, icing probability and visibility, many of which may influence landing and takeoff operations.

4.4 Target variable

We define a remaining time to arrival (RTA) value, which is used as target variable for our model. RTA is the difference (in seconds) between the timestamp of each state vector and the actual landing time of the flight to which it belongs. We use surveillance information to obtain the landing time, even though flight plans provide an end time value, but it usually corresponds to the time at which the pilot was cleared to initiate the landing procedure, several minutes before actually landing.

5 Architecture

4D trajectories consist on long sequences (several hundreds or even thousands) of flight points, which hide long- and short-term temporal dependencies within the multiple time series they comprise. As stated in Sect. 3, LSTM-based networks are able to learn from long sequences of data, such as 4D trajectories, to capture both short-term and long-term dependencies between the elements in the sequence. This fact motivates our decision to build a LSTM-based neural network to predict the estimated time of arrival of a flight, as shown in Fig. 3. This architecture consists of a single LSTM layer, with a hidden state of dimension n, and a fully connected (FC) layer with one cell and linear activation, to transform the output of the LSTM layer (a vector of length n) into a scalar value, which is the predicted RTA value for the input sequence. There are some considerations for the input sequences to LSTM networks that need to be taken into account to make ETA predictions over 4D trajectories.

Fig. 3
figure 3

LSTM model architecture. Unrolled form (right) explicitly represents each timestep in the input sequence

Fixed-length sequences LSTM networks expect input sequences of fixed length, but 4D trajectories from different routes, or even trajectories within the same route, may have different lengths (depending on the travel distance and the amount of available surveillance state vectors), so trajectory data need to be transformed into a suitable form. We use a sliding window of length lookback (lb) to ensure fixed-length sequences: for a trajectory with p state vectors, \(({\textbf {x}}_1, {\textbf {x}}_2, ..., {\textbf {x}}_p)\), we generate \(p-lb\) sub-sequences of length lb (as illustrated in Fig. 4) and label each one with the RTA value corresponding to the last vector in that window. Thus, lb is a determinant parameter to our model: the longer the input sequence, the more clues the model has to make a prediction, but it also increases the complexity and time of the training process.

Fig. 4
figure 4

Extraction of sub-sequences using a sliding window

Sequences with regular periodicity

LSTM networks are not explicitly designed to deal with incomplete or irregular time series data, so irregular patterns in the time dimension, due to missing elements or uneven element spacing, can affect the performance of the predictive model [30]. This is the case with ADS-B surveillance data, and that is due to two main reasons. On the one hand, it is not the aircraft that provides the timestamp when sending the ADS-B messages, but the ground receivers when receiving those messages; as a result, chronologically sorted surveillance data may not be in the same order as they were sent. On the other hand, surveillance coverage is limited in some (mainly maritime) regions, so that messages sent when the plane is flying above these areas are often lost. There are also other problems, such as ADS-B messages captured by multiple receivers, and therefore having different timestamps, or messages where the timing information is inconsistent for different reasons. Figure 2 illustrates these situations. Any ADS-B message broadcasted outside of the combined coverage area of the receiver network will be lost. On the contrary, if there are two or more receivers in range, each of the receivers will capture the message and set its timestamp as the time of reception. If the transmission times (\(t_1\) and \(t_2\) in the figure) are different, then the same ADS-B message is recorded twice with inconsistent time data.

All these situations are addressed to ensure that the input data are evenly distributed over time, with a regular time interval between adjacent elements. First, trajectories are downsampled [30] to ensure higher temporal uniformity. The resulting representation can be seen as a summarised trajectory, in which the generalisation of the discovered patterns is improved and potential noise data are removed. However, sampling might discard valuable information if applied too aggressively. Second, sub-sequences that contain gaps of more than a given time threshold are also removed, to minimize irregularities. In this paper, we set this threshold at a maximum of 3 min between adjacent state vectors.

6 Case study

In this work, we focus our case study on the Adolfo Suárez-Madrid Barajas airport (ICAO code, LEMD), the leading Spanish airport in terms of passenger traffic and the fifth in Europe in 2022. LEMD has four physical runways, arranged as two pairs of parallel runways, that can be used for either takeoff or landing operations, depending on the current runway configuration. LEMD uses two different configurations: north (north-facing runways are used for takeoffs and south-east-facing runways are used for landings) and south (vice-versa), which are chosen on the basis of weather conditions and available resources. A sample of 200 flights landing at LEMD for January 2022 is illustrated in Fig. 5, which shows prevailing the runway configuration at that moment. The north configuration was the most frequently used configuration in the first two months of 2022, but from March onwards, the distribution between configurations became more even due to the change in prevailing weather conditions. Due to this fact, the estimated time of arrival at Madrid-Barajas is even more uncertain. This is because incoming flights may need to execute different approximation manoeuvres depending on the airport’s current configuration in order to land on the assigned runway. These manoeuvres may take several minutes and cannot be anticipated, since the landing runway is only assigned and communicated to the pilot when the aircraft is already close to the airport and therefore cannot be used as input to the proposed model.

Fig. 5
figure 5

Sample of 200 flights arriving at LEMD (Jan, 2022)

6.1 Dataset generation

Our study covers incoming flights to LEMD in the first nine months of 2022, i.e. we collected data from the above-mentioned sources from 1 January to 30 September 2022. It is worth noting that there is a significant imbalance in the number of flights from each departure airport to LEMD, which could bias the model in favour of more frequent routes. We choose the 40 most frequent routes to avoid it and limit them to a maximum of 70 trajectories per month. Figure 6 shows the 40 selected airports on the map and indicates, using colours, the number of available trajectories for each one during the study period.

Fig. 6
figure 6

Airports considered in the study. Marker colour indicates the number of trajectories in our dataset

The acquired raw data collection needs to be transformed to ensure high-quality 4D trajectories. This process is performed in three stages that are described as follows.

Trajectory reconstruction This first stage focuses on determining flight trajectories and enriching them with flight plan data. First, we search the Network Manager‘s Flight Plan feed for flights arriving at LEMD, from all of the 40 selected departure airports. Their identification (aircraft ICAO24 code and flight callsign) and time information (departure and arrival times) are then used to assign ADS-B vectors to individual flights and reconstruct the corresponding trajectory. Surveillance or flight plan data that cannot be joined are discarded at this time, and flights with less than 300 state vectors are also removed. Finally, each vector is enhanced with the latest available weather forecast (the most recent report published before the vector timestamp) valid for the scheduled arrival time.

Quality checking Once reconstructed, the trajectories are cleaned to remove the quality problems inherent in ADS-B: incorrect time information, duplicate data or incorrect field values (altitude, speed, GPS position, etc.). It includes various cleaning operations: elimination of vectors with unrealistic latitude, longitude, altitude or velocity values; sorting of the state vectors within the trajectory in case of misplaced vectors in the time sequence; and linear interpolation of timestamp and altitude values for reordered vectors. Then, RTA values are updated accordingly to the new timestamps (i.e. the difference between its timestamp and the landing time).

Trajectory selection Trajectories with multiple loops during a holding procedure are removed at this stage, as they present a different challenge, due to their complexity and unpredictability [31]. Holding procedures force an aircraft to wait in the air until they are given permission to land at the airport and are characterized by their looping trajectory pattern near the airport. As stated before, these procedures cannot be predicted before the aircraft enters in the TMA. Given the formulation of our problem, holdings create a time shift in the entire trajectory (holding manoeuvres can last several minutes), which leads to an inconsistent computation of the RTA for state vectors of similar nature. In total, 338 trajectories (1.61%) with multiple holding patterns were removed.

Finally, trajectories that exceed the monthly limit set for each departure airport are randomly discarded at this stage. The resulting dataset from the above process consists of 19,633,275 state vectors from 20,560 trajectories, describing flights from the 40 selected airports to LEMD, during the study period. The monthly distribution of trajectories among the airports is shown in Appendix A.

This Appendix presents the distribution of trajectories among the airports considered in the study. The data are divided in months, since the data were sampled to ensure a homogeneous time distribution and an equivalent representation of each route.

6.2 Adaptation to the model

Additional transformations must be applied to the resulting dataset to satisfy LSTM constraints for different experimental configurations. First, trajectories are downsampled to ensure a regular distribution over time of their state vectors. The state vectors of each trajectory are divided into buckets of SP seconds (sampling period) according to their timestamp: the first vector of each bucket (in chronological order) is kept, and the rest are discarded. This operation is depicted in the top half of Fig. 7 for the case SP = 15. We also remove all sub-sequences that contain a gap of more than 180 s, between adjacent vectors. Then, the categorical features are transformed into real values using label encoding (i.e. replacing each categorical value with an integer), and all features are normalized into [0,1] range according to Eq. 1, where v is the original value of the feature f, and \(v^f_{\textrm{min}}\) and \(v^f_{\textrm{max}}\) are the minimum and maximum values of the distribution for that feature.

$$\begin{aligned} v' = \frac{v - v^f_{\textrm{min}}}{v^f_{\textrm{max}} - v^f_{\textrm{min}}} \end{aligned}$$
(1)

Finally, trajectories are transformed into fixed-length sequences of lb elements (lookback), as illustrated in Fig. 4. For each trajectory, we generate all possible subsequences of lb neighbouring vectors (using a sliding window of equal length) and assign them the RTA value of the last vector in the window.

It is worth noting that the length of the “flight history” we use to make a prediction is determined by the sampling period and the lookback value. For example, for a sampling of 60 s and a lookback of 32 vectors, the model has access to the last 32 min of flight time to make a prediction. If the lookback is increased to 64 (keeping 60 s of sampling), then the “flight history” taken into account extends to the last 64 min of flight time. This is also true if we increase sampling period to 120 s, while keeping 32 vectors of lookback. However, these choices are not equivalent as the configuration with a lower lookback and higher sampling (32 and 120, respectively) provides less detailed data, because it describes the same time period with half as many state vectors. Figure 7 shows an example with lookback = 5. If we sample every 15 s (SP:15), a single window represents the previous 75 s of flight time. However, if the sampling period is 30 s, then each window amounts for 150 s of flight time. While the length of the window is the same in all cases (5 state vectors), the flight time and the level of detail in which the trajectory is described are different for each configuration.

Fig. 7
figure 7

Example of a downsampled trajectory where the periodicity is increased from \(\tilde{5}\) s in the original OpenSky data to 15 s. The values of sampling period and lookback (in the figure, lookback = 5) determine the time the model has access to make a prediction

7 Experiments

This section describes the experimental process we have performed to evaluate our proposal and compare it with the state of the art. First, we describe the experimental setup and methodology that we have followed to assess the performance and generalizability of our model. Then, we present our main findings and discuss the results and their contribution to the state of the art.

7.1 Experimental Setup

In the following, we provide a comprehensive description of the experimental setup used in our study. Note that all experiments are conducted on a 4-core Intel Core i5-1035G4 at 1.10 GHz with 16GB RAM. No GPU acceleration was used. The execution environment includes Python 3.9, TensorFlow 2.9.1 and Keras 2.11.0 for LSTM models, and Scikit-Learn 1.1.3 for GBM, AB and RF models.

Dataset The dataset produced by the generation process described in Sect. 6.1 is divided into the usual train, validation and test subsets (containing 72.25%, 12.75% and 15% of the trajectories, respectively). A randomized, stratified approach is applied by distributing trajectories in direct proportion to their monthly and route frequency. In this way, the trajectories are evenly distributed across the three subsets according to the distribution of the original data. Finally, data in each subset are adapted to the particular model configuration, according to the process described in Sect. 6.2.

LSTM model We evaluate two parameters that have a direct impact on the model: lookback and units. On the one hand, lookback determines the number of individual elements (state vectors) the model expects to process in order to make a prediction. The longer the sequence, the more information the model has to characterize the evolution of the flight. However, longer sequences require more computing power or model complexity to learn long-term, complex patterns from the data. We choose lookback values of 32 and 64, because our preliminary experiments reported poor performance for \(l=4,8,16\), and values larger than 64 were discarded as it was not possible to generate windows for some of the shorter routes considered. On the other hand, the number of units determines the dimensions of the internal representation that the model constructs from the input data. The higher this value, the more complex the model and the greater the risk of overfitting. After some preliminary testing, values of 10, 20, and 30 units were chosen.

We assign fixed values to the other hyperparameters of the model: we set the hyperbolic tangent as the activation function; the batch size to 128; the loss function to mean absolute error (MAE), and Adam [32] is used as optimizer. We also experimented with the ReLU activation function, but it caused unstable training processes due to exploding gradient problems. During training, early stopping was used as a regularization measure to avoid overfitting. Models were trained for 30 epochs and the version with the lowest validation loss was selected for evaluation. These configurations were applied to data with a sampling period of 30 and 60 s.

Baseline We consider a baseline that includes the most prominent approaches to ETA prediction in the state of the art: Gradient Boosting Machines (GBM), Random Forest (RF) and Adaptive Boosting (AB). Similar to [14], all models were configured by using the default hyperparameters from their implementation in the Scikit-learn package (version 1.1.3), but reducing the number of estimators from 100 to 50 in all models due to memory constraints. Preliminary tests showed that AB performed poorly compared to other models because the decision trees used as the default base estimator were too shallow. We decided to replace the base estimator of AB with the estimator implemented in RF, reporting better results. None of these models are designed to work with time series, so we provide individual state vectors as inputs instead of the constructed windows.

As an aside, we also tested using GRUs instead of LSTM units, but found no significant differences in model performance, so we excluded them from our study.

Metrics The models are evaluated using MAE (mean absolute error) and RMSE (root-mean-squared error) metrics. MAE is the mean of the absolute values of the differences between each objective RTA value, \(y_i\), and the predicted value, \(\hat{y}_i\), for every input i (Eq. 2). RMSE is the square root of the mean value of the squares of the differences between each objective value and the predicted value across all examples (Eq. 3). Due to its linear nature, MAE weights equally each example regardless of its error value. In RMSE errors are squared, so larger errors are weighted more than smaller errors and can be used as a metric of the variance of the error values. The values are provided in seconds to enable direct comparisons.

$$\begin{aligned} \text {MAE}= & {} \frac{1}{n}\sum _{i=1}^{n}|y_i-\hat{y}_i| \end{aligned}$$
(2)
$$\begin{aligned} \text {RMSE}= & {} \sqrt{\frac{1}{\sum }_{i=1}^{n}(y_i-\hat{y}_i)^2} \end{aligned}$$
(3)

In this paper, we use two types of metrics: (i) global metrics to indicate the mean error value across all sequences (examples) in the dataset, regardless of the point of the trajectory in which they are placed; and (ii) at-time and at-distance metrics, which are used to characterize the prediction error at particular points of the trajectory. These particular points can be selected by time (e.g. the error 60 min before landing) or by distance (e.g. the error at 100 NM from the destination airport).

Note that longer trajectories are subject to greater uncertainty, but “cutting” all trajectories at the same remaining time or at the same distance to the arrival airport allows for fair comparison, regardless of the route they describe. To calculate these metrics, we evaluate the model on the last available sequence at the selected point in the trajectory, provided that it is close enough to the cutting point. We define maximum thresholds of 300 s and 10 NM for the difference between the cutting point and the RTA value and distance of the last state vector in the sequence. Otherwise, the sequence is not taken into account in the evaluation, as it is not representative of the designated point in the trajectory.

7.2 Results and analysis

Table 3 shows global metric results for different model configurations. To make it easier to refer to individual configurations, we will use the notation (SP:X,LB:Y,U:Z) to indicate the model with sampling period of X seconds, lookback of Y vectors, and Z units, respectively. Note that the third column (Time) refers to the length (in minutes) of the input sequences, and it is obtained by multiplying the value of SP and LB. The results for the baseline methods are also reported at the bottom of the table using data sampled at 30 s, as in our most prominent configuration.

Table 3 Global metrics
Fig. 8
figure 8

MAE and RMSE values of baseline models and best LSTM configuration

We first analyse the impact of downsampling by comparing same time values and different samplings. In this case, they report similar numbers for sequences describing either 16 or 32 min of flight time, but SP:30 outperforms SP:60 for longer sequences, reporting a reduction in MAE between 0.13 and 4.36 s. It is also shown that the number of units has little effect on the reported error values, but it is worth noting that more units imply higher training costs. Finally, the lookback value has the greatest impact on the result. By doubling the number of state vectors of the input sequences, the model has access to twice the flight time, which helps to better characterize its current state and provide a more accurate prediction. (SP:60,LB:32,U:30) reduces MAE in 50.39 s compared to (SP:60,LB:16,U:30), and this difference is consistent for 10 and 20 unit models. The gap is even larger for SP:30, where (SP:30,LB:64,U:20) outperforms (SP:30,LB:32,U:20) in 60.82 s in terms of MAE. The smallest MAE (159.24 s) is reported by the configuration (SP:30,LB:64,U:20), although (SP:60,LB:32,U:30) also reports competitive error values, increasing MAE by \(\approx\)2 s (161.17 s). In all cases, the RMSE values for each configuration confirm all of these observations.

We promote the configuration (SP:30,LB:64,U:20) for comparison purposes with the baseline models, which are trained on data with the same sample period. As shown in Fig. 8, GBM and AB provide similar results, with a MAE of 220.45 and 222.41 s, respectively. However, GBM achieves a significantly lower RMSE value than AB, indicating that GBM error values have a lower variance and thus its predictions are more consistent. RF performs poorly in both metrics, confirming that boosting approaches are superior to ensemble approaches, as noted in the related work. The LSTM model is superior to all the baselines, improving their best result in terms of MAE by 61.23 s. The study of the RMSE yields the same conclusions, with LSTM providing a result 65.09 s better than the best baseline method, GBM, which highlights the improved stability of LSTM over GBM.

Table 4 Metrics at-time for LSTM models (time = 32 min). Lowest values (in seconds) for each metric are bolded
Table 5 Metrics at-distance for LSTM models (time=32 min). Lowest values (in seconds) for each metric are bolded

To complete this analysis, Tables 4 and 5 show the values for the at-time and at-distance metrics for our two selected configurations. Accordingly with the global metrics, (SP:30,LB:64,U:20) outperforms (SP:60,LB:32,U:30) on all at-time metrics up to 90 min before landing, although the differences are small, between 2 and 4 s. However, SP:60 model yields better results at 120 and 150 min, with a MAE 8.71 and 18.20 s lower, respectively. These evaluation metrics measure the performance when most of the remaining flights are still at cruise level, although they exclude most short-range routes (e.g. those corresponding to national flights in Spain). SP:30 has a bigger density of data around the destination airport than SP:60, which may be biasing the model towards the area surrounding the airport in spite of farther parts of the routes. Also, sampling reduces the amount of noise in the data and thus may help the model to generalize better at higher rates when there are not sudden changes in the trajectory. In consequence, SP:30 models might be best fitted to sequences near the airport, where the more detailed representation of the trajectory may benefit the model, while SP:60 performs better on sequences that are farther away from the airport, where there is less variability in the trajectory. This situation holds true for the at-distance metrics, although the differences are generally negligible, in concordance with the global results shown in Table 3.

In all cases, our models significantly improve the results reported by GBM, AB, and RF, both in terms of at-time and at-distance metrics, with the exception of RMSE at 150 min, where GBM performs slightly better than LSTM models.

7.3 Generalization assessment

This section focuses on further evaluating the performance, quality and reusability of our approach, but on a more realistic scenario. For this purpose, we use trajectories that were flown at a later date than the trajectories used to train the models. These trajectories describe incoming flights to Madrid-Barajas (from the 40 selected departure airports) from 1 to 31 October 2022. The resulting dataset (using the generation process described in Sect. 6.1) consists of 2,224 trajectories and 2,512,429 state vectors.

Table 6 Evaluation with future data

Table 6 reports global MAE and at-time metrics for this scenario, including our best model configurations and baseline methods. The performance deteriorates for all models, as expected, but ours still lead the comparison. The LSTM SP:30 model reporting a 25.7% higher global MAE, with an increment of 41 s in absolute terms, compared to the results shown in Table 3. The at-time metrics also increase, but each at a different rate: at 15, 30, 60, 90 and 120 min, the MAE increases by 8%, 18%, 24%, 39.5% and 27.5%, respectively. This analysis is also valid for the LSTM SP:60 model, since its figures are comparable. These results indicate that the model is more robust the closer the aircraft gets to the destination airport. The most likely reason is twofold. On the one hand, the manoeuvres around the airport are standardized, and there is less variability in how the flight should progress. On the other hand, there are much more data on how the aircraft will behave in the surroundings of the destination airport than in any other area of the airspace, so the model may have learned better about this section of the flights. Having more data for each route should help the model to reduce the gap in the generalizability at different time horizons.

LSTM remains the leading model by a wide margin, improving the MAE by 52 s over GBM, which is the most effective state-of-the-art model in this experiment. This comparison also applies for every at-time metric considered, demonstrating LSTM’s superiority.

7.4 Individual airport models comparison

We conducted several experiments to assess whether a global approach would be better than training specific models for each individual route, as stated in [18]. In particular, we trained several (SP:30,LB:64,U:20) models using data from each of these individual routes. Each model was trained and evaluated on the subset of the global dataset corresponding to the trajectories that belong to its particular route. Therefore, the global model and each individual model share exactly the same data about the corresponding route. All models were trained for 40 epochs in the same conditions that were used to train the (SP:30,LB:64,U:20) global model, including the partition of the data in train, test and validation: the training data of each of these models were the data from the same route that were used to train the global model. The same holds true for test and validation datasets.

The results of these experiments are shown in Appendix 1 for all departure airports, but we focus on five of them: Frankfurt Main International (EDDF), Germany; Manchester Airport (EGCC), UK; Amsterdam Airport Schiphol (EHAM), Netherlands; Aeroporto Internazionale Marco Polo di Venezia (LIPZ), Italy; and Istanbul Airport (LTFM), which are chosen to cover the main European routes arriving at LEMD airport. A sample of the corresponding trajectories is depicted in Fig. 9. Nevertheless, it is worth noting that similar conclusions can be drawn for routes that follow the same airways to fly to Madrid.

Fig. 9
figure 9

A sample of the trajectories from the five airports considered in the study of the individual models

Table 7 Individual route model vs. global model approach

Table 7 shows the results of evaluating global and individual models on the test datasets corresponding to each of the selected routes. The global model performed consistently better on every metric for each of the routes. The bottom of the table reports the mean improvements observed on these five routes. The largest reductions in MAE can be observed closer to the airport. EDDF and EGCC benefit the most from using the global model: they achieve improvements of 86.7 and 83.9 s in global MAE. In particular, EDDF shows a great improvement at 15 min, with a reduction of MAE of 96 s. These routes have in common that the origin airport is located in the hearth of the European airspace, and thus the synergy with trajectories from other routes, which is observed in the surroundings of the airport, is extended along most of the trajectory. Other routes, such as LTFM, do not have as much path in common with the other trajectories used to train the global model, so they do not benefit as much as the rest of the considered individual routes, with a MAE improvement of 47.8 s. In a middle ground we find EHAM, which still achieves a MAE reduction of almost a minute with the global model.

Provided that global and individual models were trained with the same data from each route, it becomes clear that the global model takes advantage of the availability of data from other routes. The European airspace is structured as a route network, where flights converge on airways, which roughly determine the route that an aircraft must follow to fly to the destination airport. Once in an airway, most flights will behave similarly under similar conditions (aircraft model, weather conditions, etc.). This synergy becomes more apparent closer to the airport, given that flights follow standard procedures in the surroundings of the airport to approach to the runway and perform a landing procedure, so the model benefits from having data from more flights, even if they belong to different routes. Therefore, global models can learn a wider variety of patterns using data from different routes, thereby improving their performance on each route.

These results show that using a global model is better than maintaining multiple models, one per route (i.e. between an origin airport and a destination airport), for three main reasons: (i) the global model can predict the ETA at any point in the flight more accurately than any of the individual models that only model flights for a specific route; (ii) maintaining a single model is less costly than maintaining a large number of smaller models; and (iii) the global approach can improve predictions even on routes where there are fewer flights due to a lower flight frequency or less data availability.

7.5 Ablation test

Our approach combines data from different sources to account for different factors that influence the course of a flight (surveillance, flight plans and weather conditions). We have conducted an ablation test to determine how each factor affects the performance of our approach. In particular, we trained a LSTM model (SP:30,LB:64,U:20) with three different datasets: (i) a dataset containing only surveillance data, which is the core of 4D trajectories; (ii) a dataset enhancing surveillance with flight plan data; and (iii) a dataset combining surveillance and weather data. We replicated the training configuration used for LSTM models in previous experiments, changing only the features used by the model from the original dataset.

The results are shown in Table 8. Surveillance data are the backbone of the predictive power of the model, given that they provide detailed information about the trajectory itself, rather than factors influencing on it. On the one hand, adding flight plan features (time of day, departure airport, day of week and departure delay) slightly lowers the MAE results of the model. On the other hand, the weather features cause an increase in the error. However, flight plan and weather data in combination help the model to refine the results of using surveillance data by 49.5 s.

Table 8 Results of the ablation test

7.6 Discussion

Table 9 Overview of the main results of the works reviewed in Sect. 2

Our previous experiments confirm LSTMs models as a good choice for ETA prediction, outperforming other machine learning methods that have been successfully used for the same purpose. The comparison with [14] is particularly interesting, because the behaviour of the LSTMs in their evaluation was rather unstable and, in all cases, their accuracy was lower than that reported by methods such as AB or GBM, contrary to what happens with our proposal.

We will now take a closer look at our results in comparison with the main results of the state of the art, which are summarised in Table 9. It is worth noting that these results may not be directly comparable in quantitative terms, since each proposal analyses different case studies, but it is valuable to consider these numerical results in order to elaborate the following qualitative analysis.

One of the strengths of our proposal is its ability to provide good predictions in both the long and short term (from few minutes to several hours), which is not common in the state of the art, where existing proposals focus on one or the other case. The approaches in [19] and [17] are the most similar to ours, given that they predict ETA during the whole trajectory and using one global model for all traffic incoming into a destination airport. [19] reports MAE/RMSE values (on the test set) of 4.31 and 5.9 min, respectively, using an ensemble model comprised of a linear regressor and several GBM models. In comparison, our LSTM model with the best configuration (SP:30,LB:64,U:20) achieves a MAE/RMSE of 2.65/4.29 min. In [17], the approach based on a nonparametric additive model reached a MAE and RMSE values of 8.63 and 12.2 min, respectively, far from those reported by LSTM. However, their model was trained on data from 2010 and tested on data from 2011, so a fairer comparison may be made with the results reported in Table 6; still, LSTM achieves 3.33/5.35 min when applied on data outside of the training dataset time frame.

Nevertheless, the state of the art is mainly focused on short-term predictions. Muñoz et al. [13] report an RMSE value of 304 s at 60 min before landing, while our best configuration achieves 258 s for the same metric. Wang et al. [10] report 48 s of MAE at 25NM from the destination airport, while we achieve 55 s for the same distance. It is worth noting that this approach trains specific models for each runway, which removes one of the main sources of uncertainty in the short term, and provides only very short-term predictions in the TMA of the destination airport. Strottmann Kern and Medeiros [18] also used Random Forest and reported an reduction of 42.7% in MAE with respect to the ETMF predictions (the ATM system used by the Federal Aviation Administration). However, they do not provide any result that enables direct comparison with the results of our study. Glina et al. [8] applied Random Forests to predict short-term ETA in the surroundings of the airport using 5 days of data. The authors indicate that, during those days, there was only one active configuration in the destination airport and good weather conditions, which may reduce the complexity of the problem. The model is evaluated in a short-term scenario, at 20 and 60 NM away from the destination airport, with a MAE of 58 and 75 s, respectively. Our LSTM model reaches slightly better results at 20 NM and worse at 60 NM, with 56 and 103 s. This difference may be due to the airport characteristics: in LEMD, the 60 NM radious falls inside the area where the aircraft usually manoeuvres, and therefore is the most difficult part of the flight to be predicted.

Dhief et al. [11] report competitive numbers at 100NM: MAE of 85–101 s and RMSE of 104–125 s, depending on the runway and the model, outperforming ours: 132 and 215 s of MAE and RMSE. The difference between both errors is particularly interesting in our case, because it denotes the presence of outliers (which are more heavily penalized by RMSE than by MAE). This is due to the fact that, in our dataset, we kept trajectories with single-loop holding procedures in our dataset (while in [11] they are removed). However, the exact impact of holdings in our results is yet to be determined and will be addressed as part of our future work. Finally, the GBM model presented in [12] reports 3.16 and 4.75 min of RMSE at 45 NM and 250 NM, while our model reports 2.45 and 4.03 min at the same distances.

Ma and Du [9] proposed a complex model that combines trajectory clustering, convolutional neural networks, LSTM and attention mechanisms to predict ETA in the TMA using surveillance, congestion and weather data. That is, their results are based on the elapsed time since the aircraft enters into the TMA of the airport. As indicated in the paper, the average time between the entering in the TMA and the landing time is approximately 20 min, so their results (MAE of 89.39 s) are comparable to our 15 min at-time metric, which is a slightly higher (93.80 s).

Finally, Ayhan et al. [14] created different models for a sample of individual routes between Spanish airports, and the results supported the authors’ claim that LSTM produced unstable predictions. The results of this work may not be directly comparable with our proposal since they make the prediction before the aircraft takes off; however, we include the following observation. In this paper, there is only one route in common with [14]; i.e. the route between LECO (A Coruña Airport, Spain) and LEMD. They reported that AdaBoost performed best at predicting this route, with a RMSE value of 3.12 and 3.71 min for their AdaBoost and LSTM models, respectively. For reference, our global LSTM model yields a RMSE value of 2.61 min, which is the same than that of the individual model for this route. This route was not included in Sect. 7.4 for a deeper analysis because the short length of its trajectories did not allow the calculation of at-time metrics beyond 15 min.

8 Conclusions and future work

We have presented a novel approach to predict the estimated time of arrival using LSTM neural networks, with the aim of taking a further step towards the predictability of air traffic management operations. We have used surveillance, flight plan and weather data to describe incoming flights to the Adolfo Suárez-Madrid Barajas airport, which, to the best of our knowledge, has never been addressed before, despite its considerable importance at national and international level. We have conducted an exhaustive evaluation of different model configurations to better exploit the features of the generated dataset, reporting competitive numbers. Our proposal achieved overall MAE and RMSE values of 2.5 and 4.25 min, respectively, outperforming a baseline consisting of leading models such as RF, GBM and AB, which have reported competitive results for ETA prediction. Besides, we are able to provide accurate predictions at the entire flight, being competitive with solutions specifically designed for short- or long-term predictions.

Our future work will focus on two main lines of research. On the one hand, we consider to enhance our approach with more advanced deep learning techniques, such as attention mechanisms, to improve the detection of relevant information in the input sequences, or convolutional neural networks, to improve the interpretation of positional data, which is key for ETA prediction purposes. On the other hand, we plan to design a more ambitious case study to gain a deeper understanding of the unique challenges of ETA estimation, to identify new sources of uncertainty and to build a portable solution that can be used in different airspace domains.