A Unified Framework for Online Trip Destination Prediction

Trip destination prediction is an area of increasing importance in many applications such as trip planning, autonomous driving and electric vehicles. Even though this problem could be naturally addressed in an online learning paradigm where data is arriving in a sequential fashion, the majority of research has rather considered the offline setting. In this paper, we present a unified framework for trip destination prediction in an online setting, which is suitable for both online training and online prediction. For this purpose, we develop two clustering algorithms and integrate them within two online prediction models for this problem. We investigate the different configurations of clustering algorithms and prediction models on a real-world dataset. We demonstrate that both the clustering and the entire framework yield consistent results compared to the offline setting. Finally, we propose a novel regret metric for evaluating the entire online framework in comparison to its offline counterpart. This metric makes it possible to relate the source of erroneous predictions to either the clustering or the prediction model. Using this metric, we show that the proposed methods converge to a probability distribution resembling the true underlying distribution with a lower regret than all of the baselines.


Introduction
In today's society almost all newly produced cars are equipped with some sort of built-in navigation system using GPS-based data. Consequently, new research areas have emerged in data analytics for transport systems due to the large amount of geospatial data being shared over cellular networks. One such area is future route and destination prediction, which has been considerably focused on in the last decade. Being able to accurately predict the future location and/or route of a vehicle has some obvious advantages for individual vehicles, e.g., avoiding traffic congestion, estimating travel time and electric range, improved personalization and adaptation, etc. For the same reason, it can also be beneficial for the Traffic Management Centers (TMC) to better estimate future traffic situations.
Another potential application area for future route and destination prediction is in designing energy management systems for hybrid vehicles, i.e., to control the power split between the internal combustion engine and the electrical machine. Optimal control of hybrid vehicles is a non-trivial task when a trip (or sequence of trips) is exceeding the all-electrical range. Several studies have been performed comparing the simple electric vehicle/charge sustaining arXiv:2101.04520v2 [cs.LG] 29 Dec 2022 (EV/CS) strategy with blended strategies, i.e., strategies that continuously blend the energy from the battery and the fuel in such a way that the battery is depleted at the very end of a trip. In comparison to EV/CS, such strategies have been shown to yield a significant reduction of the fuel consumption, e.g., up to 20% as reported in [23]. The main drawback of blended strategies is that they require a priori information about the trip in order to find the best possible strategy, e.g., destination, route, travel time, etc. However, such information is not necessarily available. The authors in [25] noted that if a trip or route is recognized from a driving history, then one can employ a blended strategy for that specific trip. If that is not possible for the specific trip, one can always revert to the simple EV/CS strategy. The model they presented was able to reduce the fuel consumption by 1.5%, without any a priori information in comparison to the EV/CS strategy. In [34], the authors use driving histories in order to optimize the EMS and learn a look-up table for the controller parameters based on frequently traversed routes. The result is an EMS that consumes only 2.5% more energy than the corresponding posterior global optimal result. Thereby, in this paper, we study prediction of trip destination in different settings. We denote the full trip history of an individual user by X u . An individual trip will be referred to as x u (i), for i = 1, . . . , N u , where N u is the length of the trip history, i.e. the number of trips available for that user. Each trip consists of GPS-coordinates, i.e. the latitude and longitude pair of the source and destination. Let X u (i) = {x s u (i), x d u (i)}, where x s u (i) and x d u (i) represent the source and the destination of trip x u (i). One can also form X s u and X d u , which are the concatenation of all sources and destination in the entire trip history of user u. In its simplest form, the task can then be specified to predict x d u (i) from x s u (i) for future trips x u (i). Trip destination prediction can be considered in an online or an offline setting, which are the two very different ways of approaching the task. In the offline setting, one would train a model on the full trip history X s u and X d u in order to predict all future trips. However, in the online setting one does not assume to have access to the full trip history, and instead considers the trips arriving in a sequential or incremental fashion. In other words, the prediction of x d u (i) from x s u (i) is made after having observed all trips X u (i ) for i < i and once the actual trip X u (i) is observed the model is immediately updated. The problem in this setting then becomes to perform as good as possible in comparison to the offline model trained on the full trip history.
As will be discussed in Section 2, several previous works study trip destination problem from different aspects. However, those studies separate the prediction model from the formation of the prediction space, i.e., the clustering of candidate locations. In particular, the clustering is not considered in the evaluation of the final model. In addition, almost all of the previous works assume an offline setting where they investigate the model after it has been trained on a full dataset.
In this paper, we develop and investigate a unified framework fully suitable for online training and prediction. For this purpose, two novel online clustering algorithms are used with two different online prediction models, which enable the entire framework to be investigated in an online setting. The two clustering algorithms are adaptations of an incremental variant of the DBSCAN clustering algorithm [14]. Instead of storing all previously seen points, the two proposed algorithms store centroids belonging to the different clusters as well as the outlier points. The first prediction model is a probabilistic Bayesian model, using sequential updates for its parameters. The second prediction model is an adaptation of an expert model, where the set of available experts is dynamic. Both models yield a distribution over the possible destinations, which can be compared to the true distribution obtained by the offline model.
The different configurations of clustering algorithm and prediction model are investigated on a real-world dataset. At first, the clustering algorithms are evaluated using supervised clustering metrics, where the offline DBSCAN is considered the true clustering. Secondly, the entire framework is evaluated using accuracy, which is traditionally used for trip destination prediction. It is shown that these configurations yield consistent results to the offline model on unseen data. Furthermore, the online framework is evaluated using a novel metric based on the Hellinger distance, such that it is possible to relate the source of erroneous predictions to either the clustering or the prediction model. From this, one can observe that the learning improves as more trips are added.
The rest of the paper is organized as follows. In Section 2, we review the related works and position our contribution w.r.t. them. In Section 3, we introduce and formalize an offline methodology to serve as the baseline for the proposed online models. It consists of clustering candidate locations and estimating transition probabilities between the discovered locations. Section 4 extends and adapts both the clustering and the prediction model in the offline methodology to an online setting, i.e., the case where data arrives sequentially over time. In addition, we introduce an evaluation metric to measure the performance of the entire pipeline. In Section 5, we conduct the experiments and evaluate our proposed models on real-world datasets. Finally, in Section 6, we conclude the paper and discuss the future directions.

Related work
To the best of our knowledge, [4] is one of the first works on the topic of predicting user destination from historical GPS-data. This model can be split into two parts: (i) clustering the raw GPS-points into candidate destination, and (ii) using a Markov model with the candidate locations as states to predict the next destination. This methodology, wherein the candidate locations are first found from historical GPS-logs and later used in a graphical model (some variants of Markov models) has since been adopted by a number of subsequent works [3,30,27,35].
In [30], a Hidden Markov Model (HMM) is built using the links obtained by mapping GPS positions to a map database. For an on-going trip, this model can be used to find both the next road link and the next sequence of road links that are most likely. In other words, it can be used to predict the most likely future route/destination of an on-going trip. In order to make the model independent of a map database, the authors in [3] suggested to consider support points along traversed routes, e.g., intersections that could differentiate between possible destinations. Using the support points as observable states and the candidate locations as hidden states, an HMM is built and used to predict the destination of trips. Other approaches that have been investigated include similarity matching between the current trip and previously stored trips. In [24], the authors suggest to predict the route by applying string matching techniques to a database of stored routes. The authors in [17] present a similarity algorithm to cluster similar trips using hierarchical clustering and then predict the most likely route/destination.
A more recent work is [27], wherein a probabilistic Bayesian model conditioned on the origin and current road link is used to predict the future destination. In [12] the authors develop a Bayesian framework to model route patterns, and present a model based on Markov chains to probabilistically predict the route/destination of an ongoing trip. Perhaps the closest related work is [16], wherein a k-nearest neighbors (kNN) model is used to find the most important destinations and a Markov chain model is employed for predictions. Both the available destinations and the Markov chain transitions are updated in an online setting. However, they are not considering the clustering method as part of their online framework. Subsequently, their model is only evaluated using accuracy on repeat trips, i.e., trips that are never repeated are removed in a preproccesing step. They do not use an explicit destination clustering method, their online model is simply based on counting, and in their evaluations they use only accuracy.
There are also a significant number of works using slightly different types of data. In [11], the authors look at two online learning algorithms applied to a Bayes net model to predict destinations of users. However, the data that they consider is based on voluntary check-ins from social networking websites. There has also been extensive work on public transport and taxi data, e.g., in 2015 the method [5] won the ECML/PKDD 15: Taxi Trajectory Prediction on Kaggle. The proposed model consists of a clustering model to find candidate locations, followed by a recurrent neural network (since it was trained on initial trajectories of trips). The destinations were predicted using a weighted average of the softmax output of the different cluster centroids. Taxi services have also been phrased as a reinforcement problem, see for example [18], where the authors use the total profit of a taxi driver as the reward function, the location and status of the taxi as the state space, and the choices of operating the taxi as the action space. Finally, the authors in [10] propose an offline trip destination model for public transport based on trip histories in an evaluation set and the neighboring user trips.
We also note that there are problem formulations that are similar, or closely related, to trip destination prediction. One such example is trip purpose prediction, see, e.g. [13,9,33], where one tries to predict the purpose of a trip, for instance shopping, restaurant, education, etc. Another example is when trip destination prediction is used as a sub-problem, e.g., the authors in [31] use trip destination prediction of taxi data in order to forecast gathering events. We also note that destination prediction in general can be employed by trip planning methods, e.g., the methods described in [21,1] for navigation and in [2] for bottleneck identification, which propose effective plans for given sources and destinations.
All of these works assume that the prediction model is separated from clustering candidate locations which represent the prediction space. Thus, they consider the clustering as a preprocessing step, and then narrow down the trips that are being considered to only the frequent trips. Even in the few cases where the clustering has been mentioned and explored explicitly, it has not been reflected in the evaluation of the final model. Furthermore, almost all of the previous works assume an offline setting for their model, fully investigating their model after it has been trained on a full dataset. In this paper, we develop a unified online learning framework that i) takes into account the creation and evolution of clusters in a consistent way, and ii) learns the model and the parameters in an online fashion. For this purpose, we propose two novel online clustering algorithms to be used with two different online prediction models, and investigate the entire framework in a fully online setting.

Offline trip prediction
In this section, we introduce an offline mode for prediction of trip destinations. The offline model will serve as the baseline, to which the proposed online model will be compared. Our model is adopted from the methodology proposed by [4], which suggests to first use clustering to find candidate locations and then estimate the transition probabilities between each of the found locations.

Clustering
For clustering, we use DBSCAN [14] which is a simple yet effective density based method. It does not require to fix the number of clusters in advance and is also robust to outliers. It has also been shown to work well on low dimensional data. Since the data being considered in this study is 2-dimensional with latitude and longitude features, DBSCAN is considered to be appropriate. Furthermore, this method also has the advantage of making the clusters easy to interpret in terms of the choice of parameter values. The algorithm takes two parameters, which is the minimum distance between the points to be considered inside the same neighborhood, and m which is the minimum number of points required inside a neighborhood to form a cluster. Since the trip history X u consists of GPS-coordinates, a suitable distance metric is the Haversine distance. The Haversine distance [28] between two points, x and y, consisting of latitude and longitude features is defined as Here we have used the short-hand notation: Ideally, it should not matter whether one chooses to cluster the sources X s u or X d u , since the end of one trip corresponds to the beginning of another. However, one recurring problem when working with GPS-data is the initial time it takes for the GPS receiver to acquire the satellite signal. This delay is usually in the range between 10 to 60 seconds, but could be as long as a couple of minutes, which makes the source of every trip more uncertain than the destination. Thus, it is reasonable to use X d u to form the clusters and find the candidate locations. Clustering all destinations in X d u using DBSCAN returns the cluster labels C d u for the destinations, where c d u (i) for i = 1 . . . , N u is used to denote the label of an individual trip destination. Further, let the set of all cluster labels be denoted by where c d u (i) = −1 corresponds to an outlier. One still needs to find the corresponding cluster labels of the source of each trip, i.e. C s u , even though they were not used in the clustering procedure. For the reasons already mentioned (robustness of the source), one cannot simply assign the source to a cluster if the distance to a dense neighborhood is less than . Instead, whenever there are two or more clusters, we compute C s u according to Alg. 1, which essentially assigns a cluster label to c s u (i) if the closest cluster is δ times closer than the second closest cluster, and otherwise it is considered an outlier.
Algorithm 1 Find the corresponding cluster label c s u (i) of the source of a trip x s u (i).

Bayesian prediction model
After the clustering procedure, the entire trip history can be represented with the cluster labels C s u and C d u for the sources and destinations respectively. All the transitions made by the user u are c s represents the probability of the destination being k given that the trip starts at j.
We assume that the distribution p(c d u ) and all of the conditional distributions p(c d u |c s u ) follow categorical distributions.
In the following subsection, we describe how to estimate the parameters of Λ and λ j for j = −1, 0 . . . , K − 1.
Parameter estimation. We adopt a Bayesian approach to estimate the parameters. In this approach, we use a prior distribution over Λ and λ j for all j = −1, 0, . . . , K − 1, after which Bayes' rule can be used to update the posterior distribution. Starting with Λ, assume that p(Λ|β) follows a Dirichlet distribution with some concentration parameters β = (β −1 , β 0 , . . . , β K−1 ). Using Bayes' rule, the posterior p(Λ|C d u , β) can then be computed as: where β's are the hyperparameters. The two terms in the numerator, i.e. p(C d u |Λ) and p(Λ|β), are often referred to as the likelihood and the prior distribution respectively. On the other hand, the denominator p(C d u |β) is the marginal likelihood, or evidence, which refers to the distribution once the parameter Λ has been marginalized out.
Using the fact that the Dirichlet distribution is the conjugate prior to the categorical distribution, it holds that p(Λ|C d u , β) ∼ Dir(K + 1, n + β), where n = (n −1 , n 0 , . . . , n K−1 ) and n j is the number of trips ending in j. Essentially, the hyperparameters β can be treated as pseudocounts in the model. In other words, the event probabilities Λ are set to Estimating λ j for j = −1, 0 . . . , K − 1 follows a similar procedure as when estimating Λ. Given a single value of j, assume that p(λ j |α j ) follows a Dirichlet distribution with hyperparameters α j = (α j,0 , . . . , α j,K−1 ). According to Bayes' rule, the posterior p(λ j |C {j} u , α j ) is determined by is used to denote all trips starting in j. Using conjugacy, it holds that p(λ j |C {j} u , α j ) ∼ Dir(K +1,n j +α i ), wheren j = (n j,−1 , n j,0 , . . . , n j,K−1 ) and n jk is the number of trips going from j to k. Therefore, the event probabilities are λ jk = n jk + α jk j n jk + α jk , and once again α jk can be interpreted as pseudocounts.
We note that setting the hyperparameters to zero in a distribution will yield the maximum likelihood estimate of the corresponding event probabilities. However, this would imply that all the transitions not present in the dataset will have a zero probability of occurring in the model, e.g. α jk = 0 and n jk = 0 would result in λ jk = p(c d u = k|c s u = j) = 0.

Online trip prediction
In this section, we extend the offline prediction model to an online setting. The trip history of a user, X u , will now arrive sequentially, one trip at a time, which the offline model cannot handle. The offline clustering requires the entire X u 's to find the candidate location, and the prediction model requires the cluster labels C u to estimate the transition probabilities. Thus, to make the entire pipeline online, both the clustering and the prediction model have to be adapted appropriately.
The proposed online framework is shown in Fig. 1. For each new trip, we use the clustering model to find the cluster of the source (starting point) of the trip. The output from the clustering is then used to predict the destination using the prediction model. Next, the actual destination is observed after the trip takes place and is used to update both the clustering and the prediction model (and their parameters). Finally, the predicted destination is evaluated in comparison to the actual destination. Therefore, the framework consists of three main components: i) prediction, ii) learning, and iii) evaluation. These steps are then repeated for every new trip that is observed. Figure 1: An overview of the proposed online framework, which includes a clustering and a prediction model, both updated incrementally. It is organized in three components: i) prediction, ii) learning, and iii) evaluation.

Online clustering
Online variants of different clustering algorithms have already been proposed in the literature, e.g. an incremental variant of DBSCAN is proposed in [15]. Inspired by this incremental adaptation of DBSCAN, we propose two different variants of a DBSCAN to cluster the points online. The main difference is that instead of storing the core points, these variants store core centroids and keep track of the number of points within a specified radius. Both variants take the same parameters as the original DBSCAN algorithm, i.e. m and , representing the minimum number of points in a cluster and the minimum distance threshold respectively.
For every new point x d u (i) that arrives, the point is clustered and c d u (i) is obtained as a result of the clustering. In order to determine c s u (i), we employ the same approach as in the offline setting. That is, in order to assign a label, the closest cluster has to be at least δ times closer than the second closest one, and otherwise it is considered an outlier.

Online DBSCAN 1
The first variant presented in Alg. 2 takes an additional parameter r, which is used to determine the radii of the centroids that are stored. When r → 0, the centroids will naturally become points, in which case the method will behave as the incremental adaptation of DBSCAN in [15]. The centroids are stored as (c q , n q , l q ), where the elements are the centroid itself, the number of points it contains, and the cluster label of the centroid respectively. All non-clustered points are stored as (x s , n s , t s ), where the elements are the point itself, the number of neighbors, and the timestamp of the point respectively. Next, the function CHECKFORNEWCENTROIDS(P x ∪ {p x }) looks at all non-clustered points in P x as well as the possible new point p x and finds the points that should be upgraded to a centroid (line 14). This is the case for all points where n s ≥ m − 1. Depending on whether the neighbors are part of an existing centroid or not, three different scenarios can occur: 1. Only neighbors in P : Add the centroid to a new cluster.
2. Neighbors in a single cluster: Add the centroid to the existing cluster.
3. Neighbors in multiple clusters: Merge the existing clusters and add the centroid to the resultant cluster.
Finally, all the points that are too old according to the function DELETEOLDPOINTS(t) are removed from P (line 15). This function is elaborated further in the experiments section.

Algorithm 2 Online clustering -V1
Input: New point x, and timestamp t Parameters: distance threshold , minimum number of points in a cluster m, radii fraction r (i.e. r · is the radii of the centroids) n q * ← n q * + 1 8: else 9: The second variant, shown in Alg. 3, does not have any additional parameters and the stored elements are slightly different. All non-clustered points are still stored as (x s , n s , t s ). However, the centroids are now stored as (c k , n k , k ), where the first two elements are still the centroid and the number of points it contains, but the third element is now the centroid radius. In this variant, there will be only a single centroid for each cluster, but the centroid will continuously grow when new points are encountered, i.e. the individual centroid radius will increase. In this way, this variant is rather similar to the K-means clustering.

Algorithm 3 Online clustering -V2
Input: New point x, and timestamp t Parameters: distance threshold , minimum number of points in a cluster m Stored: non-clustered points P = {(x s , n s , t s )} S s=1 , centroids C = {(c k , n k , k )} K k=1 1: P x = {(x s , n s , t s ) ∈ P | DISTANCE(x, x s ) < } 2: for (x s , n s , t s ) ∈ P x do 3: n s ← n s + 1 4: end for 5: k * = arg min Many of the steps of this algorithm are identical to Alg. 2. The main difference is the individual radius of each centroid, k , which makes a difference when computing the closest cluster (line 5). The only additional change is that the function CHECKFORNEWCENTROIDS has been replaced by UPDATECLUSTERS (line 14). This function plays a similar role and looks at all non-clustered points in P x and finds the points that can be used to update the centroids, i.e. those with n s ≥ m − 1. Once again, depending on whether the neighbors are part of an existing centroid or not, three different scenarios can occur: 1. Only neighbors in P : Create a new cluster. 2. Neighbors in a single centroid: Update the existing centroid, i.e. the radius k , to contain the point. 3. Neighbors in multiple centroids: Merge the existing centroids, i.e. update the centroid c k and radius k to cover all of the previous centroids as well as the new point.

Bayesian model
We adapt the offline prediction model to the online setting. The difference is that the set of all cluster labels now can change with every new trip that is observed, i.e. M u has to be replaced with M u (i). Parameter estimation. In this setting, we can still update the posterior distribution of Λ(i) and λ j (i) using Bayes' rule. However, here we employ sequential Bayesian updating, which works by letting the prior distribution in each timestep be the posterior distribution of the previous timestep. In detail, looking at Λ(i), the update rule can be written as where [i] is used to denote trip i and all previous trips, whereas (i) is used to denote the specific trip i.
There is one problem with this formulation, however, since Λ(i + 1) and Λ(i) do not necessarily have the same number of classes/clusters. This means that the prior p(Λ(i) | C d u [i], β) needs to be changed to the followinĝ whereΛ(i) should be computed from Λ(i). This is feasible, since the clustering algorithms in Alg. 2 and Alg. 3 return the updates from the clustering when creating new centroids. This leads to the following scenarios forΛ(i): • New cluster label,k: • Merged cluster labels,k ∈K: Thus, the final update is written as Since the Dirichlet distribution is conjugate prior to the Categorical distribution, it holds that p(Λ(i + 1)|C d u [i + 1], β) ∼ Dir(K(i + 1) + 1, n(i + 1) + β), where n(i + 1) = (n −1 (i + 1), n 0 (i + 1), . . . , n K(i+1)−1 (i + 1)) and n j (i + 1) is the number of trips in C d u [i + 1] ending in j up until timestep i + 1.
A similar approach can be performed for the conditional distributions with λ j (i). One will then end up with the update rule where the notation C {j} u is once again used to denote all the trips starting in j. Again, it holds that p(λ j (i + 1)|C {j} u [i + 1], α j ) ∼ Dir(K(i + 1) + 1,n j (t + 1) + α j ) due to conjugacy, wheren j (t + 1) = (n j,−1 (i + 1), n j,0 (i + 1), . . . , n j,K(i+1)−1 (i + 1)) and n jk (i + 1) is the number of transitions from j to k up until timestep i + 1. Finally, setting α jk = 0 would result in the maximum likelihood estimate of the parameters.

Expert model
Another option for the prediction model in the online framework is to use an expert model, as presented in [6]. An expert model requires a set of experts and a reward function. In our case, the action set corresponds to the set of possible destinations and the reward is 1 if an expert makes the correct prediction, and 0 otherwise. More precisely, expert models, or learning with expert advice, is an online learning approach where the rewards in each timestep are known for all available actions. In this section, we adapt this approach to our trip destination problem, where we address several issues such as a dynamic action set.
The athors in [22] presents an algorithm called Follow the Awake Leader (FTAL), which considers the expert setting with a dynamic set of available actions at every timestep. It introduces the concept of sleeping experts, which means that the experts are allowed to sleep for some periods of time, i.e. they are not available at those specific time periods. With some modifications we adapt it to our setup. We consider the following assumptions: 1. There is an infinite number of sleeping experts, 2. Once an expert wakes up, it will stay awake.

Experts can merge.
The modified algorithm is described in Alg. 4, where the action set A i would correspond to our cluster space M u (i) ∪ {−1} and the different experts are the possible k ∈ M u (i) ∪ {−1} options. For each new trip, the actions played previously are first put in a set A (line 3). If A is empty, then a random expert is played, and otherwise the expert with the highest average reward is selected (line 4-6). The rewards are obtained for all available actions, and the stored parameters are updated (line 9-11). Finally, the function UPDATEACTIONSET(A i−1 ) is used to update the set of available actions. However, since A i would correspond to our state space M u (i), it is obtained as a result of the clustering. In this setup, each expert suggests performing a single specific action all the time, i.e. expert k would always Algorithm 4 Online expert model Parameters: A i is the set of available action at time i, z k is the cumulative reward for action k, n k is the number of consecutive timesteps action k has been available 1: Initialize A 0 , and n k = 0, z k = 0 for all k ∈ A 0 2: for i = 1, . . . , N u do 3: if A = ∅ then Observe reward R k for all k ∈ A i−1 10:

11:
n k ← n k + 1 for all k ∈ A i−1
Similar to the Bayesian model, there will be one expert model that considers all trips, as well as one expert model conditioned on each of the starting locations. In fact, the main difference between these prediction models is the way that a destination is selected. In the Bayesian approach, the average is taken over the total number of trips, whereas in the expert model it is instead taken over the number of trips for which the destination has been available. One advantage that this approach have over the Bayesian approach is one can experiment with the definition of the rewards without changing the model itself. It is also common for expert models to be accompanied with a regret bound, which is provided in [22] for FTAL. On the other hand, with the Bayesian approach it is possible to define priors if one has access to prior information. There is also an intuitive way to include uncertainty in the predictions.

Regret analysis
A common way to investigate the performance of online learning methods is to look at the regret of the model as a function of the number of trips used for training. Here, we define the regret in comparison to the offline model being trained on the entire trip history, and then evaluated on the very same data. Let p * be the true discrete distribution conditioned on the source locations, and let p i be the corresponding predicted distribution at timestep i. The squared Hellinger distance [26] between the true and the predicted distribution can then be defined as However, this formulation assumes that both distributions are defined on the same probability space X , which does not necessarily hold in our case.
Let p i be defined on X i and p * on X and assume that there is a surjective function f : i.e. the number of elements in X that maps to f (x) ∈ X i . Then, one way to define the squared Hellinger distance between p * and p i is: where the probability p i (x ) is split equally amongst all p * (x) where f (x) = x . The first sum can be rewritten to yield the following formulation: We consider this formulation from this point forward.
We split the metric into two sub-errors, H 2 d (p * , p i ) and H 2 s (p * , p i ), representing the distributional error and state-space error respectively. First of all, let us define the distributional error: which essentially implies that p i (x ) should be equal to the sum of p * (x) for all x ∈ X such that f (x) = x . The state-space error can then be implicitly defined as Note that this is only properly defined if H 2 (p * , p i ) ≥ H 2 d (p * , p i ), which is not trivially true for the parts of the sum where f c (x) > 1. However, the following theorem shows that this indeed holds. Proof. The overall squared Hellinger distance for these states are: (1) The distribution error can be written as: Now, we show from Eq. 1 and 2 that H 2 (p, q) ≥ H 2 d (p, q), i.e. after some simplifications we show that: The left hand side of the last inequality can be rewritten as and the right hand side can be written as Looking only at the first terms in these expression, one notices that it is a scaled down form of the original problem. Thus, if we can show that we have proven the claim. This inequality can be simplified accordingly: where the last inequality is trivially true. Thus, we concluded that H 2 (p, q) ≥ H 2 d (p, q).
Thus, the Hellinger regret can finally be defined as i.e. the cumulative squared Hellinger distance.

Experiments
In this section, we investigate and evaluate the online trip prediction framework, i.e. both the clustering technique and the prediction model, on a real-world dataset of private vehicle trip histories. The clustering is mainly evaluated using the well known cluster metrics, in order to understand how the clusters of the online clustering evolve as more trips are observed. The online prediction models, and the full online framework, are instead evaluated in comparison to the offline pipeline using the accuracy on a held out test set. Furthermore, they are evaluated using a novel regret metric based on the Hellinger distance, which evaluates the similarity between the predicted and the true distribution of destinations.

Data
In this paper, we use the real-world data collected in [20]. It consists of over 700 GPS-tracked vehicles, or devices, registered either in the county of Västra Götaland or in Kungsbacka municipality. These are located in the south-western part of Sweden and include Gothenburg, which is the second largest city of the country.
A dataset consisting of trips for each of the vehicles is extracted from the original GPS-logs. This was done by defining the end of a trip as the loss of GPS fixation, i.e., when the vehicle has been turned off. In addition, a vehicle speed of less than 0.1 km/h for 10 minutes was also used to signify the end of a trip. Finally, two consecutive trips have been merged if the time between them is less than 10 seconds.
In order to be somewhat consistent with the data processing performed in similar works [4,3,35], we perform additional filtering to the data provided in [20]: i Trips being shorter than 100 meters, or less than 4 minutes, are discarded.
ii Vehicles with a trip history shorter than 30 days, or with a frequency of less than 1 trip per day, are discarded (guarantees at least 30 distinct trips per user).
After the preprocessing, about 55% out of the original trips and 66% out of the vehicles remain (74453 out of 134756, and 473 out of 716, respectively). The remaining trip destinations can be seen in Fig. 2, which shows the power-law normalization of the values of a Gaussian kernel density estimation after they have been linearly mapped to the range [0, 1].
We observe that the majority of the trips end in the south-western part of Sweden, i.e., where the vehicles are also registered. Furthermore, in Fig. 3 the average distance and duration of the remaining trips are shown for each of the users. Across all users, the average length of a trip is 15.57 km and the average duration is 17.05 min. In this data, 413 unique private vehicles are studied. Each of these vehicles corresponds to one separate case study, which means that we effectively study 413 different cases. For this type of data this is substantial, since most of the existing public datasets are not private vehicle driving histories, but correspond to taxis, public transport, etc.

Clustering
By clustering the entire dataset with DBSCAN using the parameters = 100 m and m = 2, one can interpret each cluster as a location that the user has visited at least two times. Fig. 4 illustrates the number of clusters and the percentage of trips ending in a cluster for all the users in the dataset. On average, the number of found clusters, i.e. K, is 15.5 and 75.4% of users trips end in this set of clusters.
The same experiment is performed using the two variants of the online clustering algorithm. Fig. 5 and Fig. 6 show the results for variant 1 and variant 2 respectively. The parameters m and are the same as for the offline clustering. The additional parameter r in variant 1 is set to 1/2, and the function DELETEOLDPOINTS(t) is adjusted to remove the points older than 28 days in both variants. Other combinations of the additional parameters were tested, but the results    ending in a cluster is affected by the number of clusters, but perhaps even more by the fact that the assignment of labels is done in an online way. In other words, the first time a place is visited it cannot yet be labeled as a candidate location and will at that point in time be considered an outlier. This means that the first point to appear where a cluster is going to be formed will never be counted.
Another way that one can evaluate the online clustering algorithms is to look at the evolution of the rand score, mutual information, and v-measure score [19,32,29]. All these metrics compare the predicted cluster labels of each trip with the true labels and yield a score that is upper limited by 1, where a score of 1 indicates a perfect match. The true labels in this comparison are those obtained from the offline clustering algorithm, when run on the full dataset, i.e. on all available trips for each user.
In Figs. 7 and 8, one can see the performance using these metrics for the two variants of the online clustering algorithms. In general, we observe that the average of all metrics appears to increase as more trips are processed, as expected. Once again, it is important to emphasize that the cluster labels of the online clustering algorithms are produced in an online manner.
Another interesting aspect is the similarity between the true clusters and those found by the online clustering algorithms after considering the full trip history. This can be done by computing the labels of the online clustering algorithm after it has been trained on the full trip history. The histogram in Fig. 9 shows the difference between the two clustering variants. Using the same metrics, one can see that the clusters obtained from the first variant are more similar to the true ones, since the first variant consistently yields higherscores than the second variant.  Thus, in the worst case scenario, the clusters obtained from the first variant appears to be more stable as well. This could partly be attributed to the fact that the first Variant has the possibility to shape the clusters as a union of centroids, i.e., they are not necessarily circular.

Evaluation of the entire framework
To investigate the full framework, the first 80% of each users' trip history is used to create a training set, leaving the rest for testing. A clustering algorithm is run to produce C s u and C d u for the training set, and the transitions are used to estimate the parameters of the distributions.

Offline setting
If the starting location is an outlier, i.e. c s u = −1, the distribution p(c d u ) is used to predict the destination by k * = arg max k∈Mu p(c d u = k). If c s u is not an outlier, i.e. c s u = j where j ∈ M u , the prediction is done as k * = arg max k∈Mu p(c d u = k | c s u = j). In other words, the prediction always corresponds to the cluster with the highest probability, excluding the outliers.
By evaluating the accuracy of the predictions, we find that out of all trips, the proposed model is able to predict the next destination in 36.15% of the cases on average. Looking only at the trips that end in one of the clusters, i.e. those that can actually be predicted, the accuracy increases to 56.22% on average. The distributions of the accuracy over the different users in both cases are displayed in Fig. 10.

Online setting
We evaluate the online setting, including both the clustering and the prediction model, on the same data as in the offline case, i.e. when testing on the last 20% of each users trip history. This yields similar results to the offline setting. For the Bayesian model the prediction is always made as the cluster with the highest probability, excluding the outliers. The source, c s u (i), is predicted using Alg. 1 with δ = 2. If the source c s u (i) = −1, i.e. it is an outlier, the distribution of p i (c d u ) is used to predict k * = arg max k∈Mu(i) p i (c d u = k). Instead, if c s u (i) is not an outlier, i.e. c s u (i) = j with j ∈ M u (i), the prediction is made according to k * = arg max k∈Mu(i) p i (c d u = k | c s u = j). Similarly, the expert model also excludes outliers in the prediction, by not considering the outliers in the set of available actions when selecting an expert. The expert model corresponding to p i (c d u ) is used when c s u (i) = −1, and the models conditioned on the starting locations are used otherwise.
In Fig. 11, the distribution of the accuracy is shown for both prediction models, as well as the two clustering variants. In general, upon visual inspection the different configurations appear to perform equally well. Furthermore, the accuracy for all trips, as well as for the subset of trips ending in a candidate location, look similar to the histograms presented for the offline setting. The Bayesian model using the first clustering variant has an average accuracy of 37.51% and 56.05% for all trips and trips ending in candidate locations, respectively. Instead, using the Bayesian model with the second clustering variant one obtains 38.26% and 56.12% for the two cases. Finally, using the expert prediction model yields 37.22% and 55.61% with the first clustering variant, and 38.05% and 55.84% with the second clustering variant. Regardless of prediction configuration, these results are comparable to the offline version, with only minor deviations.

Regret
The regret for several clusters with a large number of trips is shown in Fig. 12 for both the Bayesian model and the Expert algorithm using the two clustering variants. The regret is split into the state-space error and the distribution error. For comparison, we consider three baselines: -Baseline, where we only use the distribution of p(c d u ), i.e. never conditioning on the source of a trip. -ExpWeights, which is the exponential weights algorithm [6]. -Greedy, that is a greedy variant of the Bayesian model, i.e. we select the most likely options with 100% confidence.
It is worth to emphasize that no established baseline exists for this specific problem setting. The baselines that do exist for trip destination prediction do not consider online learning, or even the creation of the state space (clusters), in their model. Looking at these three examples, we observe that the Bayesian model and the expert algorithm outperform the alternative methods.
We also observe the impact of the online clustering variant on the performance. In the first example (device 298, cluster 2) the second clustering variant helps to decrease the distribution error, without increasing the state-space error. The reason is that this variant assigns most of the early trips to the different clusters, since the clusters generated by this variant cover a larger area. However, this could also be a disadvantage, which is the case for the last example (device 948, cluster 2). In this example, two clusters are merged, which punishes the state-space error heavily. In general, most cases behave similar to the second example (device 685, cluster 1), where the two clustering variants yield a similar performance.
Lastly, in Fig. 12 we also observe a sub linear behaviour for the complete error in all the three examples, which indicates that learning improves over time. This is true for essentially all examples of devices/clusters that have a sufficient trip history. As an example, in Fig. 13   Bayesian -V1 Expert -V1 ExpWeights -V1 Baseline -V1 Greedy -V1 Bayesian -V2 Expert -V2 ExpWeights -V2 Baseline -V2 Greedy -V2 Figure 12: Hellinger regret for the Bayesian method and the expert algorithm, compared to three baselines: Nonconditioned distribution, Exponential Weights algorithm, and a Greedy algorithm.

Conclusion
In this work, we developed a unified online framework for trip destination prediction consisting of (i) clustering and (ii) prediction model. The online prediction models are generic and can easily be adapted to other offline or online prediction models in the Bayesian settings that has been studied, e.g., conditioned on additional attributes.
Firstly, we proposed two novel online clustering algorithms and two different online prediction models. The clustering algorithms are online adaptation of the offline DBSCAN method, where the clusters are stored as centroids. The first prediction models is an online adaptation of a Bayesian model conditioned on the starting position, whereas the second option is an adaption of an expert algorithm.
Secondly, we evaluated the online clustering algorithms and the full online framework on a real world trip dataset. The clustering methods were shown to find the most important clusters of the offline solution. We also demonstrated that the full framework yields consistent results with the offline model on unseen data.
Finally, we introduced a new evaluation metric suitable for the online framework. This metric is able to distinguish between distributional error and state-space error, i.e., to distinguish between the clustering and prediction errors.
With sufficient trip histories we were able to show that the proposed methods converge to a probability distribution resembling the true underlying distribution with a lower regret compared to the baselines.
Future work will consist of adding side information to the prediction models, such as time of day, month of year, weather, calendar information, etc. Perhaps the simplest way to accomplish this is to condition the models on the additional attributes. Another future extension is to investigate mixture of offline and online models and study how the performance improves if we already have an initial set of trips and a model trained based on them. Finally, since the clusters represent the geographical locations, then it would be interesting to investigate a hierarchical variant of DBSCAN or other hierarchical clustering methods [7,8] in order ro take the cluster proximities into account.