1 Introduction

Deep learning (DL) has become a popular approach for developing data-driven prediction, classification, and anomaly detection solutions. Work on deep learning from trajectory data is spread out over many domains, including but not limited to computer science [3, 11, 14], geographic information science [13, 17, 24], intelligent transportation systems [51], maritime sciences [8, 63], and ecology [34]. Consequently, the resulting body of work covers many different use cases and trajectory dataset types. A data-centric way to categorize trajectory datasets is according to their level of detail [19] along the mobility data continuum ranging from detailed dense trajectories (quasi-continuous tracking data of individual movement) to sparse trajectories (such as check-in data of individuals), and finally, aggregated trajectories (crowd-level information, typically aggregated to edges/nodes in a mobility graph, to a grid, or to a set of points of interest) [2, 28, 67]. Even though the level of detail of the underlying trajectory-based training data is an important factor determining the potential capabilities as well as the scale or resolution of derived machine learning models, cursory scanning of paper titles and abstracts is usually insufficient to determine which type of trajectory data was used to train the presented deep learning models. To gain a better understanding of the state of the art, it is instead necessary to review the data and methods sections in detail. While many papers start with dense trajectories, most convert them into sparse trajectories [1, 8, 13, 24, 43, 62, 63] or even aggregate them to crowd-level [4, 55, 66]. Common approaches to turning dense trajectories into sparse trajectories include: converting them into a sequence of stop locations [1, 24] or a sequence of traversed regions (grid cells) [13, 43], or converting them to trajectory images [8, 62, 63].

The recent literature includes review papers for specific mobility use cases, for example, in their review of location encoding methods for GeoAI [39], the authors stress the analogy between NLP word-to-sentence relations and location-to-trajectory relations. This analogy has led to Word2Vec-inspired approaches encoding location into a location embedding using, for example, Location2Vec [68], Place2Vec [61], or POI2Vec [15]. Another recent review [27] is dedicated to deep learning for traffic flow prediction models, which are primarily trained on aggregated trajectory data. Finally, Wang et al. [58] provide a survey of deep learning for the wider field of spatio-temporal data mining, including trajectories. However, to the best of our knowledge, there is no review paper that provides a comprehensive overview of the different neural network (NN) architectures used to learn from trajectory data, specifically for mobility use cases.

The goal of this work, therefore, is to provide a first overview of the current state of deep neural networks trained with trajectory data, structured by: 1. Use case category (travel time, crowd flow, and location predictions; location and trajectory classifications; as well as anomaly detection and synthetic data generation), 2. Neural network architecture (including, for example, CNN, RNN, LSTM, and GNN), and 3. Trajectory data granularity (dense, sparse, aggregated) and representation. For practical reasons, the in-depth qualitative assessment part of this paper does not cover all relevant works published in recent years exhaustively. However, we provide at least one paper for each use case and DL combination that we have identified. We specifically reviewed publications at recent events, including SIGSpatial 2022Footnote 1 [13, 24, 38, 43, 53, 55, 60, 65], Sussex-Huawei Locomotion (SHL) Challenge 2021Footnote 2 at the ACM international joint conference on pervasive and ubiquitous computing (UbiComp) [57], Traffic4cast challenge 2021Footnote 3 at NeurIPS [36], and Big Movement Data Analytics workshop BMDA 2021Footnote 4 at EDBT [4, 33, 54].

Even though we focus explicitly on deep learning, it is worth noting that deep learning may not always be the best approach [21] to address a particular challenge. For example, the SHL Challenge summary [57] reports that classic machine learning (ML) models outperform deep learning models on their three metrics: F1 score, train time, and test time. In Section 3 we therefore review the main reasons stated by authors to motivate the use of deep learning as well as the baselines and metrics used in evaluations.

This review does not attempt to compare the performance of different deep learning approaches. Even though there are some commonly used open datasets (such as the Porto taxi,Footnote 5 the T-Drive taxi,Footnote 6 the GeoLife,Footnote 7 and the Gowalla check-in datasetsFootnote 8) cross-paper comparisons outside of dedicated data challenges (such as the DEBS Grand Challenge 2018 [22], SHL Challenge 2021 [57]), and Traffic4cast challenge 2021\(^{5}\)) are notoriously difficult. For example, “Despite the Porto dataset’s original use as a standardized benchmark for open competition, design choices in subsequent work make cross-paper comparison difficult. Firstly, different papers often augment the dataset with their metadata not present in the original release, which may give some models an advantage over others independent of architecture or training design.” [53]

The remainder of this paper is structured as follows: Section 2 presents the recent deep learning research structured by the main use cases, neural network architecture, and trajectory data granularity and representation. Section 3 summarizes and discusses the motivations provided for the use of DL over classic ML and the benchmarks and metrics used to evaluate their performance. Section 4 extends the time frame of our analysis to further analyze the trends in this research area. Finally, Section 5 summarizes the findings and presents our conclusions. Since the terminology used in different publications is not necessarily consistent due to the large range of domains working on trajectory data analysis, we, define the most important terms and abbreviations in a glossary in the paper appendix.

2 Representing trajectory data for deep learning use cases

Trajectory datasets used in research and industry are highly heterogeneous since they are affected by numerous technical and design choices. For example, in Graser et al. [19], we distinguish 10 dimensions along which movement dataset may vary: 1. Spatial resolution (fine or coarse / small or large location errors), 2. Spatial dimensions (2D or 3D), 3. Temporal resolution (sparse or frequent / quasi-continuous), 4. Sampling interval (regular or irregular), 5. Representation (polylines or continuous curves), 6. Constraints (network-constrained or unconstrained / open space), 7. Collection models (Lagrangian or Eulerian / checkpoint-based), 8. Tracking system (cooperative or uncooperative), 9. Privacy (personal or impersonal), 10. Data size (small or big datasets / streams). It is therefore not hard to imagine that there can be no one-size-fits-all approach to modelling trajectory data for machine learning. Instead, it is necessary to pick the right model for the use case and data available.

The literature on deep learning from trajectory data can be roughly divided into eight use case categories, as shown in Table 1. The first three use cases, consisting of: 1. Trajectory prediction (& imputation), 2. Arrival time prediction mostly, and 3. (Sub)trajectory classification mostly use dense / quasi-continuous trajectory data, while the following use cases: 4. Anomaly detection, 5. Next location (& destination) prediction, 6. Synthetic data generation, 7. Location classification use cases mostly, and 8. Traffic volume / crowd flow prediction tend to use increasingly sparse / check-in data or aggregated trajectories of multiple movers.

Table 1 Overview of trajectory data representations used to train neural networks using the following dataset categorization: trajectories of individual movers marked with View full size image and View full size image

To further illustrate the wide variety of different trajectory data representations encountered in these use cases, Fig. 1 provides an overview ordered by the identified trajectory representations from dense individual trajectories at the top to crowd-level / aggregated movement trajectory data at the bottom. In many instances, the borders between these categories are fuzzy since they reside on a spectrum. While use cases such as Trajectory prediction, Next location prediction and Traffic volume / crowd flow prediction are clearly clustered in certain sections along this spectrum, other use cases, such as Anomaly detection and Synthetic data generation appear all over since, for example, anomalies can be analyzed both on the level of individual trajectories as well as on the crowd level where analysts may be interested in detecting unusual crowd patterns.

More details on the trajectory datasets and the data engineering steps applied to the trajectory data before they are used as input to train the neural networks are summarized in Tables 2. Some works (e.g. [17, 48]) use additional data sources (such as the road network and POIs) in combination with trajectory data to train their models. These additional data sources have not been included in our review in favor of clarity and conciseness.

The following subsections describe each use case in more detail and introduce the neural network designs used to address the use cases, as well as the trajectory data representations used to train these neural networks.

Fig. 1
figure 1

Overview of trajectory data representations used to train neural networks

Table 2 Trajectory datasets and data engineering; public datasets are printed in bold

2.1 Trajectory prediction / imputation

This use case category covers the prediction of trajectories in artificial [6], urban [13], and maritime environments [5, 41, 54], as well as the imputation of trajectories [43]. Depending on the context, trajectory prediction is also called movement prediction [41, 54] or future location prediction (FLP) [54] which can lead to confusions with the next location / destination prediction use case. However, this use case differs from the next location / destination prediction use case (which we will discuss later) in that it aims to predict the future path of a mover, that is, along which path the mover will get to its next location.

To predict trajectory paths with high spatial detail, it is necessary to learn from high-resolution training data. Therefore, the training data for these models consists of rather high-density trajectories of individual movers, which may be resampled, generalized, or discretized, as shown in Fig. 1. For example, Capobianco et al. [5] resample each trajectory to a fixed sampling interval of 15 minutes and create fixed-length trajectories with 12 positions, equalling a fixed trajectory duration of three hours. Mehri et al. [41] generalize AIS trajectories using context-aware piece-wise linear segmentation before feeding them into an LSTM three vertices at a time. This enables their model to perform short-term trajectory predictions with high spatial detail. Tritsarolis et al. [54], on the other hand, represent trajectories by their composition of differences in space \(\Delta x\), \(\Delta y\), and time \(\Delta t\) for the input of their RNN-based models to predict future positions \(\Delta t+1\). They even take the problem one step further and try to predict the co-movement of multiple movers through evolving clusters.

An example of more long-term trajectory predictions is presented in Fan et al. [13]. They discretize mobile phone GPS trajectories using the H3 hexagonal gridFootnote 9 and use the grid cell sequences to train their GRU. The resulting model predicts cell sequences which are afterwards used to search for similar high-resolution trajectories, which are returned as the final trajectory prediction.

Transformers are another approach gaining traction. For example, Carroll et al. [6] use synthetic discrete movement sequences in a minimalistic grid world environment to train their transformers to predict trajectories. Another work using transformers to impute trajectories is Musleh et al. [43]. They propose TrajBERT, a model trained using H3-discretized (tokenized) GPS tracks. They “down-sample the trajectories by dropping three-quarters of the points of each trajectory and then run TrajBERT to fill the gaps by imputing the missing points”.

The difficulty of a specific trajectory prediction task is influenced by the type of movement (network-constrained or unconstrained), the complexity of the movement patterns observed in the area of interest [20], as well as the desired prediction horizon (how far into the future the prediction is attempted) and the quality of the available training data. Additional information, such as the vessel destination labels [5] can further improve the prediction accuracy. The evaluation metrics for trajectory prediction tasks usually include the RMSE, MAE, or MAPE of spatial and spatiotemporal distance measures (see Table 3).

Table 3 ML and DL baselines used to compare proposed new methods, including papers that do not provide baseline comparisons (n/a)

2.2 Arrival time prediction

This use case category covers the prediction of travel times or arrival times, such as arrival time prediction in train networks [3] and street networks [11, 56]. This type of task is also known as Estimated Time of Arrival (ETA) task. In classical intelligent transportation systems (ITS), ETA is commonly computed using routing algorithms leveraging street network data with travel time information. This travel time information may be historical averages for a certain season, day of the week, and time of day, or the result of more advanced ML/DL prediction models. Since future travel times often correlate with historical travel times (for example, at a given time of day and day of the week), recurrent mechanisms are commonly used to predict them [3, 11, 56]. For example, Derrow-Pinion et al. [11] train GNNs on aggregated trajectories to provide travel time predictions in Google Maps. The GNN graph consists of segment and supersegment-level embedding vectors. Nodes store street segment-level data (average real-time and historical segment travel speeds and times, segment length, and road class), while edges store supersegment-level data (real-time supersegment travel times).

In contrast, Wang et al. [56] introduces the GEO-convolutional network layer (GEO-Conv, also used by Buijse et al. [3]), which is trained on dense trajectories rather than aggregated trajectories. The proposed GEO-Conv layer takes dense trajectories as input and applies a non-linear mapping of each trajectory point, followed by a GEO-Conv step with multiple kernels.

The difficulty of a specific arrival time prediction task is influenced by the variability of the travel times in the area of interest, as well as the desired prediction horizon and the quality of the available training data (including sufficient data on relevant recurring patterns and events that may influence travel times). The evaluation metrics for arrival time prediction tasks usually include the RMSE, MAE and MAPE (see Table 3).

2.3 (Sub)trajectory classification

This use case category considers the classification of complete trajectories [63] or sub-trajectories [8, 57]. Typical applications include the detection of mover classes and movement types, such as ship maneuvers [8] or the detection of transportation and locomotion modes of smartphone users [57].

An example of the classification of complete trajectories is the recognition of ship types introduced by Yang et al. [63]. It relies on the same technique as [8] for transforming the raw AIS trajectory data into colour-coded trajectory images. The resulting images show characteristic trajectory patterns, which can be used to identify the ship vessel type with a CNN classifier.

By splitting trajectories into sub-trajectories, more fine-grained analyses are possible. For example, Chen et al. [8] generate colour-coded trajectory images from ship AIS data, where each pixel is assigned one of three colors according to the movement type (static, normal, maneuvering). These trajectory images are used to train a CNN-based ship maneuver classifier.

An example for human sub-trajectory classification is presented in the Sussex-Huawei Locomotion-Transportation (SHL) Recognition Challenge [57] which aims to identify different movement modes (i.e. still, walk, run, bike, car, bus, train, and subway) from smartphone data. The SHL Challenge winner uses AdaNet, a TensorFlow-based framework for learning NN models and ensembling models to obtain even better models  [10].

The difficulty of a specific (sub)trajectory classification task depends on the the nature of the classes of interest (some classes may easy to distinguish while others may have large overlaps with neighboring classes) as well as the quality of the training dataset (in particular the class balance which may be more or less skewed towards the popular classes and under-represent rarer classes). Additional complexity is introduced if – in addition to the classification – the splits between subtrajectories have to be determined as well. The evaluation metrics for (sub)trajectory classification tasks usually include F1 score, accuracy, precision and recall (see Table 3).

2.4 Anomaly detection

This use case category deals with identification and handling of unusual observations and patterns in the data, often referred to as outliers or anomalies. Both terms are often used interchangeably [7]. Outliers can be defined as observations that “deviate so much from other observations as to arouse suspicions that it was generated by a different mechanism” [23]. Similarly, anomalies are patterns that do not conform to an expected behavior [7].

Types of movement anomalies include anomalous records (unusual spatiotemporal or thematic attributes), anomalous (sub)trajectories, and anomalous events (when “the trajectories of individual movers are unremarkable but their combined spatio-temporal pattern is unusual” [19]). Since the definition of anomalies is often context-dependent, ground truth labeled data is rare. Therefore, anomaly detection approaches often resort to trying to identify trajectories that deviate significantly from previously observed trajectories based on some spatial, spatiotemporal, or other metrics. GeoTrackNet [44], for example, is a model for maritime trajectory anomaly detection, which consists of a probabilistic RNN-based (Recurrent Neural Network) representation of AIS tracks and a contrario detection [12]. Anomalies detected by GeoTrackNet were then evaluated by AIS experts. Similarly, Singh et al. [51] present an anomaly detection system based on RNN regression models to detect anomalous trajectories, on-off switching, and unusual turns. Again, a quantitative accuracy analysis is not feasible due to the lack of ground truth data.

To address the issue of lacking ground truth, some researchers resort to using synthetically generated anomalies [33]. For example, Liatsikou et al. [33] developed an LSTM-based network for the automatic detection of movement anomalies, such as the detection of synthetic anomalies in taxi trajectories. The anomalies that can be detected are of limited length since the autoencoder requires inputs of a certain fixed length, all trajectories are clipped to nine points (and shorter ones discarded).

The difficulty of a specific anomaly detection task depends on the anomalies of interest (and how much they deviate from regularly observed behavior) as well as the availability of labelled training data. In settings where only detected anomalies can be presented to experts for assessment, there is a lack of dependable quantitative data on true and false negatives. The evaluation metrics for anomaly detection tasks usually include accuracy, precision and recall (see Table 3).

2.5 Next location / destination prediction

This use case category covers the prediction of the next locations or final destinations of trips [14, 16, 24, 30, 32, 35, 53]. This usually boils down to predicting the next location or final destination from a finite set of potential locations. Besides GPS tracks, a commonly used data source in this category are social media check-ins (e.g., from Foursquare). The task then becomes to predict the next check-in location.

Attention mechanisms are particularly popular for next location prediction. For example, Gao et al. [16] train VANext, a semi-supervised network trajectory convolutional network, on check-in data. They convert each individual user’s trajectory (check-in/POI sequence) into sequence embeddings using a causal embedding method (similar to a high-order Markov process). The resulting embeddings are the input for their GRU to learn the trajectory patterns. They further apply attention to the embeddings for predicting the user’s next POI. Feng et al. [14] tailor two attention mechanisms to generate independent latent vectors from large and sparse trajectories. These embeddings are then fed into their DeepMove GRUs and a historical attention module. The learned attention weights can intuitively explain the prediction based on the user’s history of movement behavior. Li et al. [30] introduce a spatio-temporal self-attention network (STSAN). They generate trajectory embeddings by concatenating the temporal (activity sequence), spatial (distance matrix of locations), and location attentions (location sequence and their categories). They feed these embeddings through a softmax layer and predict the user’s next POI. They use a federated learning setting to tackle the heterogeneity problem. Liao et al. [32] generate embeddings from location sequences as well as graph embeddings from location-location and activity-location graphs and train their MCARNN multi-task context-aware recurrent neural network to solve both activity and location prediction tasks.

Other works use neural networks for dimensionality reduction and for creating embeddings. Liu et al. [35] incorporate time and distance-specific transition matrices as temporal and spatial embeddings generated by RNNs. Hong et al. [24] reduce the dimensions of trajectories using a multilayered embedding approach for transformers to predict next location and travel mode. Tenzer et al. [53] generate two geospatial and temporal embeddings by 1. combining the random picking and the nearest neighbor to create sequences of spatial embeddings and 2. using a sinusoidal embedding to convert the timesteps to temporal vectors. They train a hyper network to learn to change its weights in response to these embeddings.

The difficulty of a specific next location / destination prediction task depends on many of the same factors discussed for trajectory prediction. The evaluation metrics for next location / destination prediction tasks usually include accuracy, precision and recall which pick up on the fact that predicting the location from a finite set of potential locations is reminiscent of a classification task. An alternative approach is to treat the task like a regression task and measure, for example, the RMSE of the spatial distance between the predicted and the true location (see Table 3).

2.6 Synthetic data generation

This category covers the generation of synthetic movement data, such as synthetic trajectories [47, 65] and synthetic flows [50]. The creation of synthetic data is of particular interest to either address data privacy concerns or to deal with a lack of real data for certain regions. For example, Rao et al. [47] address the privacy issue by developing an end-to-end deep LSTM-TrajGAN model to generate privacy-preserving synthetic trajectory data for data sharing and publication. Similarly, Zhang et al. [65] propose an end-to-end trajectory generation model for generating synthetic trajectories using VAE-like encoders and decoders where a prior generator based on variational recurrent structure generates noise at time t by considering the noise at the previous time step.

Aggregated flow data, on the other hand is usually not privacy sensitive but it may not be available in some regions or for certain time periods. Therefore, Simini et al. [50] developed an MLP model (denoted Deep Gravity) to generate mobility flow probabilities. They evaluated Deep Gravity on mobility flows in England, Italy, and New York State and achieved a good performance even for regions with no data available for training.

The difficulty of a synthetic data generation task, as well as the selection of suitable evaluation metrics for synthetic data generation depend on the specific details of the application (for example, whether it aims at generating individual trajectories or aggregated flows) and method used (see Table 3).

2.7 Location classification

This use case category covers the classification of locations, such as certain areas or points of interest (POIs), using patterns derived from movement data. The classification of regionally dominant movement patterns may be of interest in and of itself [62] or it may help with the classification of POIs (e.g. ports [1]) and trip destinations [38].

To detect regionally dominant movement patterns, Yang et al. [62] use direction information and density maps to generate directional flow images. They convert trajectories into images where each pixel contains the directional flows. They use a CNN to classify the input image patterns and detect the dominant regional movements.

An approach that makes more use of the temporal information contained in trajectories is presented by Altan et al. [1]. They use a temporal GNN (TGNN) to distinguish gateway ports from actual ports using AIS vessel movement data. After extracting the ports (nodes) from the raw AIS messages using DBSCAN, they extract trips between consecutive ports and build a graph for each time step to generate the time-ordered daily graph sequence for the TGNN.

Lyu et al. [38] train a model to predict trip purposes based on destination locations. Their model is trained using activity, origin, and destination matrices derived from OD data. They use latent mode alignment to align the geographic contextual latent with travel activities. This approach is called the plug-in memory network.

The difficulty of a specific location classification task depends on the nature of the location classes of interest and other factors previously discussed for (sub)trajectory classification tasks. An additional factor is whether the evaluation is performed using locations that fall into the same geographic region as the training locations or whether the the evaluation locations are from a different region (thus adding the challenge of model transferability to different geographic contexts). The evaluation metrics for location classification usually include precision and recall (see Table 3).

2.8 Traffic volume prediction

This use case category covers the prediction of traffic or crowd volumes and flows, including, for example, predicting traffic volume on street segments [4, 27], human activity at specific POIs [17] and metropolitan areas [31, 36], or predicting animal movement dynamics [34].

Models for this use case are commonly trained with aggregated trajectory data. For example, traffic movies are the training data provided for the Traffic4Cast 2021 competition which challenged participants to predict traffic under conditions of temporal domain shift (Covid-19 pandemic) and spatial shift (transfer to entirely new cities). Lu [36] won this challenge using CNN (U-Net) and multi-task learning. Their multi-task learning approach randomly samples from all available cities and trains the U-Net model to jointly predict the future traffic states for different cities. Wang et al. [55] also use the traffic movie approach. They aggregate individual-level trajectories into a grid with inflow referring to the total number of incoming traffic entering this region from other regions during a given time interval and outflow representing the total number of traffic leaving the region. Similarly, Zhang et al. [66] create temporal grids of average traffic speed and taxi inflow per cell.

Graph-based models represent another common approach for this use case. For example, Li et al. [31] build a graph for their GCN by aggregating CDR data and representing spatial statistical units as nodes and their relationship (physical distance, physical movement, phone calls) as edges. Similarly, Lippert et al. [34] build temporal graphs from bird migration data where nodes represent radar locations, and edges represent the flows between the Voronoi tessellation cells of the radar locations.

Other approaches include, for example, Buroni et al. [4] who provide a tutorial using vehicle counts derived from GPS tracks to build and train a Direct LSTM encoder-decoder model. The model is trained to predict counts of vehicles per network edge per time step for the Belgian motorway network. Similarly, Gao et al. [17] use their GPS tracks to count vehicles per POI per time step (hourly) to train a GCN+GRU model that predicts these visit counts. And Xue et al. [60] propose a translator called mobility prompting which converts daily POI visit counts into natural language sentences so they can use (and fine-tune) pre-trained NLP models such as Bert, RoBERTa, GPT-2, and XLNet to predict these visit counts.

The difficulty of a specific traffic volume prediction task is influenced by the variability of the traffic volumes in the area of interest, as well as the desired prediction horizon and the quality of available training data. The evaluation metrics for traffic volume or crowd flow prediction usually include RMSE, MAE, and MAPE (see Table 3).

3 Why deep learning?

This section attempts to summarize the reasons stated by authors to motivate the use of deep learning (DL) instead of classical machine learning (ML) and discusses the baselines, metrics, and implementations used to measure and compare the performance of DL methods. Finally, this section ends with a critical discussion of the findings.

3.1 DL motivations

Novelty is a crucial factor determining what gets published and what does not. Consequently, we encounter a lot of motivation statements stressing the novelty of applying DL to certain trajectory-related tasks, along the lines of, for example, “applying CNN for classification of trajectory data is relatively unexplored” [62] or “predicting the arrival times with the help of deep learning models has been done in some recent works [...] However, most of these works aim to predict the arrival time for road vehicles [...], and the railroad industry lagged behind” [3].

Beyond the novelty factor, a key motivation for using DL rather than traditional ML is that DL does not rely on hand-crafted features. Feature engineering for mobility prediction and classification is challenging due to: “1) the complex sequential transition regularities exhibited with time-dependent and high-order nature; 2) the multi-level periodicity of human mobility; and 3) the heterogeneity and sparsity of the collected trajectory data” [14]. Hand-crafted features may therefore run into difficulties capturing the full complex picture necessary for accurate predictions or classifications [8, 38].

DL’s ability to take advantage of large volumes of data to learn complex non-linear spatio-temporal relationships is therefore a key motivating factor. This is particularly stressed by, for example, Derrow-Pinion et al. [11], who argue that arrival time prediction “requires accounting for complex spatio-temporal interactions (modelling both the topological properties of the road network and anticipating events-such as rush hours-that may occur in the future).” On a similar note, Liatsikou et al. [33] argue that DL models for anomaly detection “can effectively address the challenges of feature extraction, high dimensionality and non-linearity. Moreover, there is no need to explicitly describe a normal pattern for a trajectory and to define the type of the anomaly. [...] the model is trained to reconstruct most of the trajectories included in a dataset, while it fails on the more irregular ones, which are classified as anomalies” And Buroni et al. [4] state that DL models for road traffic forecasting “allowed to take advantage of the enormous volume of mobility data to capture the complex non-linear space-temporal relationships governing road traffic.” Moreover, DL may help to accurately “perform multi-horizon road traffic predictions on large scale transportation networks” and still “comply to real-time requirements” [37] (which may not be achievable using established ITS or traffic simulation approaches).

Indeed, the black-box nature of DL may be advantageous in some use cases dealing with the issue of location privacy. For example, Rao et al. [47] argue that current approaches for privacy preservation “rely heavily on manually designed procedures” and that “once the procedure is disclosed, one may have the chance to recover the original trajectory data (e.g., using reverse engineering).” Furthermore, for synthetic data generation, “the trade-off between the effectiveness of trajectory privacy protection and the utility for spatial and temporal analyses is still hard to control, and this issue has not been fully discussed or evaluated. Besides, current studies mainly focus on the spatial dimension of trajectory data whereas other semantics (e.g., temporal and thematic attributes) are rarely considered. In fact, these characteristics have been proven to be crucial for trajectory user identification” [47].

The before-mentioned advantages notwithstanding, DL may not always be the best approach [21]. Reasons include the black-box nature of many DL models [42] as well as the large amounts of training data and computational resources needed to train DL models [49]. For example, Jonietz et al. [26] point out that the “lack of explainability can result in erosion of trust from users who may question the results”, which in turn may limit the practical impact of DL developments. The explainability of ML and DL models remains an open challenge [42]. While classical ML models often rely on handcrafted features engineered by domain experts, DL models automatically learn features from raw data. In the absence of large, diverse, and representative datasets, DL models may not always capture domain-specific nuances and may be prone to overfitting. Additionally, “the rise of compute-intensive AI research can have significant, negative environmental impacts” [29], depending on how and where the energy for training DL models is generated, stored, and delivered [29, 49]. The choice of DL models over traditional ML models should therefore be founded in significantly better results than simpler classic ML baselines.

3.2 Baselines & metrics

In order to evaluate the performance of new methods, it is essential for authors to report benchmark datasets and models as well as the considered metrics for evaluating their experiments. Since the machine learning tasks vary according to the considered mobility use case, comparability can at best be achieved within one use case. Therefore, Table 3 summarize the different traditional ML and DL methods that have been used as baselines as well as the metrics for each use case.

The number of ML and DL baselines provided per publication varies widely, as shown in Table 3. While publications on Traffic volume / crowd flow prediction and Next location / destination prediction tend to provide numerous baselines for comparison, the number of baselines for Location classification, Trajectory prediction, and Anomaly Detection are much lower. This may hint at a general lack of suitable reference implementations.

Table 4 Neural network implementations published together with their reviewed papers, ordered by ML library and stars

Within most use cases, authors tend to adopt similar metrics: Classification tasks, such as Location classification, (Sub)trajectory classification and – to a degree – Next location / destination prediction strongly rely on F1 scores, accuracy and precision metrics. In addition to providing correct destination predictions, another aspect is to provide the prediction as soon as possible [22]. On the other hand, regression tasks, such Arrival time prediction, Traffic volume / crowd flow prediction and Trajectory prediction tend to rely on RMSE, MAE, and MAPE. Papers on Synthetic data generation, however, do not feature prominent common metrics.

As Table 3 show, some trajectory data-specific DL methods, such as VaNext, DeepMove, and LSTM-TrajGAN, have been used repeatedly as baselines to evaluate new methods. However, not all reviewed publications provide the source code of the published models to facilitate reuse of the implementation. Table 4 lists the publicly available implementations grouped by the used machine learning library (PyTorch, TensorFlow, or Keras) and ordered by their popularity. It is worth noting that most repositories are one-shot projects, that is, the implementations are not actively worked on and maintained after publication of the paper which likely negatively affects their re-usability.

In contrast, there is a growing number of trajectory analysis libraries which offer essential functionality for machine learning from trajectory data, such as TrackintelFootnote 10 [40] (e.g., used by [24]), MovingPandasFootnote 11 [18] (e.g., used by [41]) and scikit-mobilityFootnote 12 [46] (e.g., used by [50]) as well as spatiotemporal deep learning frameworks, such as TorchGeoFootnote 13 [52] (which focuses on raster data) and GeoTorchAIFootnote 14 [9] (which includes a bicycle flow prediction example using ST-ResNet [64]).

3.3 Discussion

The lack of common benchmarks methods and datasets makes it difficult to compare methods and to systematically advance the research field. For example, Wang et al. [57], remark that “to date, most research groups assess the performance of their algorithms using their own datasets on their own recognition tasks. These tasks often differ in the sensor modalities or in the allowed recognition latency.” Additionally, there is a lack of benchmark datasets for most use cases (as noted, e.g., regarding location classification by Altan et al. [1]). For some use cases, researchers, therefore, resort to synthetic data. For example, Liatsikou et al. (2021) create their own synthetic trajectory anomalies to evaluate their model [33].

Another issue is the lack of agreed evaluation approaches and metrics. While papers introducing new DL methods – quite naturally – claim superior performance over existing ML/DL methods, data challenges, such as the SHL Challenge on traffic flow prediction, paint a less clear picture, with regular ML models often outperforming DL models on key metrics, such as F1 score, train time, and test time. For example, Wang et al. [57] compare 7 ML SHL Challenge submissions (including XGBoost, RF, DT and ensembles of classifiers) and 8 DL submissions (including CNN, LSTM, CNN+RNN, Transformer, and AdaNet), coming to the conclusion that “ML outperforms DL in global. ML has a similar higher bound as DL, while achieving a much higher lower bound. The smaller dynamic range verifies the better robustness of ML. While DL achieves the highest F1 score (75.4%), it is only 1.1 percentage points higher than the best ML approach (74.3%).” Furthermore, as would be expected “DL takes much more time for training than ML, and also takes more time for testing.” [57] Regarding DL’s independence from hand-crafted features, it is particularly interesting that “Feature-based DL approaches achieve a much higher F1 score (75.4%) than raw-data-based approaches (43.4%).” [57]

Since the majority of papers are not directly reproducible (either due to the lack of data, code, or both) or prohibitively expensive to reproduce, verification of stated claims remains a challenge. Typical pitfalls faced when evaluating models for mobility use cases include [59]: 1) different datasets employing often (undocumented) pre-processing pipelines 2) unnoticed over-fitting due to spatial and or temporal auto-correlation and 3) the accuracy/generality trade-off due to idealized conditions and lack of variation in the data. In particular, Widhalm et al. (2018) [59] show that evaluation with a random training/test split suggests a considerably higher accuracy than with proper backtesting (reducing the F1 score from 96% to 84% when accounting for auto-correlation and down to 54% when transferring the model to different conditions). Therefore, they argue that results reported in publications may not represent reliable indicators for the future performance in real-world applications, where generality and robustness are essential.

4 Trends

While previous sections provided a detailed review of use cases and types of deep neural networks for trajectory data, their focus on recent work is insufficient to comprehensively understand longer-term trends in this field. Therefore, in this section, we investigate these longer-term trends by through the following research questions: How has the DL research on mobility trajectory application domains evolved in recent years? Which DL architectures are being utilized in research, and how have they changed over time?

To answer these questions, we performed a systematic literature review (SLR) following the PRISMA method [45] for data collection and selection. PRISMA is a widely used guideline for conducting SLRs. The search was carried out on ‘Google Scholar’ and ’IEEE Xplore’ using the following keywords:

  • Google Scholar: “Deep Learning” AND “Trajectories” AND “Mobility” -“survey”

  • Google Scholar: “Machine Learning” AND “Trajectories” AND “Mobility”

  • IEEE Explor: “Deep Learning” AND “Trajectories” AND “Mobility”

and limited to the timeline between 2018 and 2023. The total number of initially retrieved publications was 337. The following steps were performed to select relevant papers: First, we discarded duplicate articles (8 publications) within and across databases. After that, we reviewed titles and abstracts as well as other parts of the publication if necessary to manually classify each publication regarding its use case and machine learning technology. Publications that either do not address one of the previously introduced eight use cases as listed in Table 1 (39 publications) or do not apply neural networks are discarded (24 publications). In the next step, all publications not published in a journal, at a conference, as a book chapter, or as a preprint (14 publications) were excluded. We also considered preprints, as many machine learning publications are initially released as preprints before being presented in a journal or at a conference much later. We hope that this will allow us to assess current trends more accurately. In addition, review papers were excluded from further analysis (13 publications). Finally, we included the papers analyzed in the previous sections (36 publications), which yielded a total of 275 relevant papers. The relevant papers comprise 111 articles, 114 journals, 25 preprints, and 0 book chapters.

Fig. 2
figure 2

Portions of publications by use case from 2018 to 2023. Total number of publications per year is displayed at the top of each bar

Fig. 3
figure 3

Portions of publications by NN design from 2018 to 2023. Total number of publications per year is displayed at the top of each bar

Figure 2 shows the trend analysis results for the eight previously introduced use cases. By far, the most popular use case is ‘Traffic volume / crowd flow prediction’, responsible for approximately  33% of publications. The next most common use cases are ‘Trajectory prediction/ imputation’, followed by ‘Next location / destination prediction’ and ‘(Sub)trajectory classification’. On the other hand, the least common use cases are ‘Anomaly detection’, ‘Synthetic data generation’, ‘Arrival time prediction’, and ‘Location classification’. It is worth noting that the numbers for 2023 only represent publications up until the time of writing (June 2023) and that publications can address more than one use case. Please refer to the Section Appendix for the absolute number of publications per use case.

To effectively analyze the vast array of diverse neural network architectures extracted from the relevant publications, we categorized and clustered them to be able to discern prevailing trends and patterns. We applied the following categorizations:

  1. 1.

    RNN – Vanilla RNN, LSTM, or GRU

  2. 2.

    GNN – Graphs and graph learning methods

  3. 3.

    CNN – Convolutional networks

  4. 4.

    AE – Autoencoders (Dense, LSTM, or CNNs)

  5. 5.

    Attention – Attention mechanisms, including transformers

  6. 6.

    GAN – Generative adversarial methods

  7. 7.

    DRL – Deep reinforcement learning

  8. 8.

    CRNN – Hybrid architectures combining CNN and RNN methods (e.g., LSTM or GRU)

  9. 9.

    FNN – Fully connected dense networks

  10. 10.

    Other DL – Any other deep learning architecture, such as Capsule Net, Gravity Net, etc.

This categorization is more fine-grained than the summary in Table 1 to allow for better differentiation between the less common NNs and their respective trends.

Fig. 4
figure 4

Overview of the deep learning approaches per use cases from 2016 to 2023

Even more than the use case trends, this neural network design trends analysis summarized in Fig. 3 is dominated by a single class: RNNs, accounting for approximately 39% of publications. The second most popular option for the early years were CNNs, which decreased in popularity in 2020. CNNs have since been overtaken by attention mechanisms and – potentially – GNNs (but the 2023 data here is still inconclusive). Autoencoders (AE) comprised approximately 18% of the publications in 2019 and spiked again in 2021, which seems to be decreasing in popularity since. FNNs, CRNNs, DRL, GAN, and other DL methods have remained niche applications. Note that publications can integrate multiple neural network designs.

Bringing both neural network design and use case perspective together, we also analyzed the techniques employed for each use case over the years, as shown in Fig. 4. This shows the prevalent use of RNNs for tasks such as traffic volume prediction, trajectory prediction, and next location/final destination prediction. Furthermore, we observed an emerging trend where GNNs are increasingly being adopted as a viable solution for traffic volume and trajectory prediction use cases. Similarly, CNNs have gained popularity as a practical approach for traffic volume prediction. However, in trajectory classification tasks, they serve as the second choice after RNNs.

5 Conclusion

In this work, we reviewed deep learning-based research focusing on trajectory data in mobility use cases. In most cases, even if dense raw trajectory data is used in the process, it is not ingested directly for training the neural networks. Instead, data engineering steps are applied that convert trajectories into more compact representations of individual trajectories (sparse trajectories) or aggregations of multiple trajectories. This aggregated trajectory data is commonly presented as time series of vectors, graphs, or images (movies). Since these data engineering steps are a recurring need in mobility data science, we expect further uptake of trajectory analysis libraries, such as Trackintel, MovingPandas, and scikit-mobility since these libraries implement many common trajectory generalization, aggregation, and analysis methods and their focus on scientific software engineering aims at reusability and long(er)-term availability of the software libraries. This is an essential step towards more sustainable mobility data science, since most of the implementations summarized in Table 4 have not been substantially updated/maintained since being published which reduces their reusability since the implementations will – sooner or later – become incompatible with newer versions of the underlying data science libraries.

Future research should address the issues of model transferability, benchmark availability, and model explainability. Current work rarely addresses the issue of model transferability. Since most existing global ML models “cannot perform well locally, or be transferred to study similar problems in other regions” [25], transferability should be considered when evaluating or comparing models. Additionally, developed models, even for the same application and trajectory type, are difficult to evaluate (for example, due to the lack of ground truth for anomaly detection) and to compare due to different datasets and applied metrics. Therefore, more open benchmark datasets are needed. Finally, to better understand the inner workings of neural networks and to ensure the trustworthiness of model decisions, explainability should play a more crucial role in model development.