Abstract
Predictive analytics over mobility data is of great importance since it can assist an analyst to predict events, such as collisions, encounters, traffic jams, etc. A typical example is anticipated location prediction, where the goal is to predict the future location of a moving object, given a lookahead time. What is even more challenging is to be able to accurately predict collective behavioural patterns of movement, such as comovement patterns as well as their course over time. In this paper, we address the problem of Online Prediction of Comovement Patterns. Furthermore, in order to be able to calculate the accuracy of our solution, we propose a comovement pattern similarity measure, which facilitates the comparison between the predicted clusters and the actual ones. Finally, we calculate the clusters’ evolution through time (survive, split, etc.) and compare the cluster evolution predicted by our framework with the actual one. Our experimental study uses two realworld mobility datasets from the maritime and urban domain, respectively, and demonstrates the effectiveness of the proposed framework.
1 Introduction
The vast spread of GPSenabled devices, such as smartphones, tablets, and GPS trackers, has led to the production of large amounts of mobility related data. By nature, this kind of data is streaming and there are several application scenarios where the processing needs to take place in an online fashion. These properties have posed new challenges in terms of efficient storage, analytics, and knowledge extraction out of such data. One of these challenges is online cluster analysis, where the goal is to unveil hidden patterns of collective behaviour from streaming trajectories, such as comovement patterns [4, 9, 10, 13, 35]. What is even more challenging is predictive analytics over mobility data, where the goal is to predict the future behaviour of moving objects, which can have a wide range of applications, such as predicting collisions, future encounters, traffic jams, etc. At an individual level, a typical and wellstudied example of such analytics is future location prediction [11, 24, 25, 29, 33], where the goal is to predict the future location of a moving object, given a lookahead time. However, prediction of future mobility behaviour at a collective level has not been addressed at the degree of its individual counterpart.
Concerning the definition of comovement patterns, there are several approaches in the literature, such as [4, 9, 10, 13]. However, most of the above are either offline and/or operate at predefined temporal snapshots that imply temporal alignment and uniform sampling, which are not realistic assumptions. For this reason, we adopt the approach presented in [35], which, to the best of our knowledge, is the first online method for the discovery of comovement patterns in mobility data that does not assume temporal alignment and uniform sampling. The goal in [35] is to discover comovement patterns, namely Evolving Clusters, in an online fashion, by employing a graphbased representation. By doing so, the problem of comovement pattern detection is transformed into identifying Maximal Cliques (MCs) (for spherical, Flocklike clusters) or Maximal Connected Subgraphs (MCSs) (for densityconnected, Convoylike clusters).
In particular, the problem that we address in this paper is the Online Prediction of Comovement Patterns. Informally, given a lookahead time interval Δt, the goal is to predict the clusters of moving objects that are anticipated to be shaped after Δt. Figure 1 illustrates such an example, where in blue and green we have the information at hand (past and current locations / clusters, respectively) whereas in orange we have the predictions. The problem we address is quite challenging, since, apart from the inherent difficulty of predicting the future, we also need to define how the error between the actual and the predicted clusters will be measured. This further implies that a predicted cluster should be correctly “matched” with the corresponding actual cluster, which is not a straightforward procedure at all. To the best of our knowledge, the problem we aim to address has not been addressed in the literature yet.
Several mobilityrelated applications could benefit from such an analytics task. In the urban traffic domain, predicting comovement patterns could assist in detecting future traffic jams, which, in turn, could help the authorities take the appropriate measures (e.g. adjusting traffic lights) in order to minimize them. In the maritime domain, a typical problem is illegal transshipment, where groups of vessels move together “close” enough for some time duration and with low speed. It becomes obvious that predicting comovement patterns could help in predicting illegal transshipment events.
Our main contributions are the following:

We provide an efficient solution to the problem of Online Prediction of Comovement Patterns.

We propose a comovement pattern similarity measure, which helps us “match” the predicted with the actual clusters.

We exploit on MONIC/FINGERPRINT [22, 27], a stateoftheart cluster evolution discovery method, in order to track the clusters’ evolution through time.

We extensively evaluate the aforementioned framework using largevolume realworld data from two different mobility domains, namely, maritime and urban transportation.
The rest of the paper is organized as follows. Section 2 discusses related work. In Section 3, we formally define the problem at hand. Subsequently, in Section 4 we propose our methodology and we introduce a comovement pattern similarity measure along with the cluster “matching” approach. Section 5, presents our experimental study and, finally, in Section 6 we conclude the paper and discuss future extensions.
2 Related work
The work performed in this paper is closely related to three topics, (a) comovement pattern discovery, (b) future location prediction, and (c) comovement pattern prediction.
Comovement pattern discovery
One of the first approaches for identifying collective mobility behaviour is the socalled flock pattern [17], which identifies groups of at least m objects that move within a disk of radius r for at least k consecutive timepoints. Inspired by this, several related works followed, such as moving clusters [15], convoys [14], swarms [19], platoons [18], traveling companion [32] and gathering pattern [39]. Even though all of these approaches provide explicit definitions of several mined patterns, their main limitation is that they search for specific collective behaviours, defined by respective parameters. An approach that defines a new generalized mobility pattern is presented in [9] where the general comovement pattern (GCMP) is proposed. In [13], the authors propose a frequent comovement pattern (fCoMP) definition for discovering patterns at multiple spatial scales, also exploiting the overall shape of the objects’ trajectories, while at the same time it relaxes the temporal and spatial constraints of the seminal works (i.e. Flocks, Convoys, etc.) in order to discover more interesting patterns. The authors in [4, 10], propose a twophase online distributed comovement pattern detection framework, which includes the clustering and the pattern enumeration phase, respectively.
Another perspective regarding comovement pattern discovery is to reduce cluster types into graph properties and view them as such. In [35], the authors propose a novel comovement pattern definition, called Evolving Clusters (EC), that unifies the definitions of flocks and convoys and reduces them to Maximal Cliques (MC), and Connected Subgraphs (MCS), respectively. In addition, the authors propose an online algorithm, that discovers several evolving cluster types simultaneously in real time, without assuming temporal alignment, in constrast to the seminal works (i.e., flocks, convoys). In our work, we elaborate on evolving clusters [35] for comovement pattern discovery. The reason why this is the most appropriate, is that we can predict the course of several pattern types at the same time, without the need to call several other algorithms, therefore adding redundant computational complexity.
Future location prediction
The fact that the Future Location Prediction (FLP) problem has been extensivelly studied brings up its importance and applicability in a wide range of applications. Towards tackling the FLP problem, one line of work includes efforts that take advantage of historical movement patterns in order to predict the future location. Such an approach is presented in [33], where the authors propose MyWay, a hybrid, patternbased approach that utilizes individual patterns when available, and when not, collective ones, in order to provide more accurate predictions and increase the predictive ability of the system. In another effort, the authors in [24, 25] utilize the work done by [31] on distributed subtrajectory clustering in order to be able to extract individual subtrajectory patterns from big mobility data. These patterns are subsequently utilized in order to predict the future location of the moving objects in parallel.
A different way of addressing the FLP problem includes machine learning approaches, such as Recurrent Neural Network (RNN) based models [26] and Gated Recurrent Units (GRU) [5] models, which constitute the newer generation of RNN or static NNs [6, 28] [23]. In this line of research, Suo et al. [29] presented a GRU model to predict vessel trajectories based on the DBSCAN algorithm to derive main trajectories and a symmetric segmentedpath distance approach to eliminate the influence of a large number of redundant data and to optimize incoming trajectories. Liu et al. [20] proposed a trajectory classifier called SpatioTemporal GRU to model the spatiotemporal correlations and irregular temporal intervals prevalently presented in spatiotemporal trajectories. Particularly, a segmented convolutional weight mechanism was proposed to capture shortterm local spatial correlations in trajectories along with an additional temporal gate to control the information flow related to the temporal interval information. Wang et al. [36] proposed a vessel berthing trajectory prediction model based on bidirectional GRU (BiGRU) and cubic spline interpolation. Hao et al. [12] proposed a vehicle trajectory prediction encoderdecoder model based on GRU with attention mechanism. Particularly, the proposed model is comprised of intention recognition module and trajectory prediction module. The former module recognizes driver’s intention and calculates the probabilities of turningleft, lanekeeping, turningright, whereas the trajectory prediction module predicts vehicle trajectory using GRU decoder with attention mechanism, which takes vehicle historical position as input and predicts future position. Zhang and Zheng [38] proposed a datadriven pedestrian trajectory predictor called MLPsocialGRU. The proposed mechanism first processes a pedestrian trajectory with a Multilayer Perceptron (MLP), and then, it adopts GRU to get hidden features of a pedestrian motion patterns, from which relationships between pedestrians can be simulated.
Comovement pattern prediction
The most similar to our work is presented in [16]. More specifically, the authors in [16], divide time into time slices of fixed step size and adopt a spherical definition of groups, where each group consists of moving objects that are confined within a radius d and their goal is to predict the centroid of the groups at the next timeslice. However, the group definition adopted in [16] is rather limited, since it identifies only spherical groups, as opposed to [8] where cylindrical clusters get detected and to [35] where both spherical and densityconnected clusters can be identified. In addition, the Group Kalman Filter (GKF) they propose, while novel, it tracks only the centroid of each pattern, and not their shape and membership.
3 Problem definition
Before we proceed to the formulation of the problem, let us provide some preliminary definitions.
Definition 1
(Trajectory) A trajectory \(T = \{p_{1}, {\dots } p_{n}\}\) of a moving object is considered as a sequence of timestamped locations, where n corresponds to the latest reported position of T. Further, p_{i} = {x_{i},y_{i},t_{i}}, with 1 ≤ i ≤ n.
Definition 2
(Future Location Prediction  FLP). Given a trajectory T_{i} and a time interval Δt, the goal of FLP is to predict \(p^{i}_{pred} = \{x^{i}_{pred}, y^{i}_{pred}\}\) at timestamp \(t^{i}_{pred} = t^{i}_{now}+{{{\varDelta }}} t\).
Definition 3
(Evolving Cluster  EC) [35]. Given a dataset D of trajectories, a minimum cardinality threshold c, a maximum distance threshold 𝜃, and a minimum time duration threshold d, an EC 〈C,t_{start},t_{end},tp〉 is a subset C ∈ D of the moving objects’ population, \(\lvert C \rvert \geq c\), which appeared at time point t_{start} and remained alive until time point t_{end} (with t_{end} − t_{start} ≥ d) during the lifetime [t_{start},t_{end}] of which the participating moving objects were spatially connected with respect to distance 𝜃 and cluster type tp (1 for maximal cliques  MC  or 2 for maximal connected components  MCS).
Definition 4
(Comovement Pattern Prediction  CPP). Given a set D of trajectories, a set G of evolving clusters up to timeslice T_{now}, a time interval Δt, and a prediction step s (submultiple of Δt), the goal of CPP is to predict all valid evolving clusters \(G^{\prime }\) up to T_{now} + Δt with step s.
Definition 5
(Cluster Evolution Prediction  CEP). Given a set EC_{curr} of current ECs, a set EC_{pred} of predicted ECs up to a time interval Δt with step s (see Definition 4), a minimum cluster membership overlap threshold τ_{match}, a minimum cluster temporal overlap threshold τ_{temp}, and a minimum cluster split overlap threshold τ_{split}, the goal of CEP is to predict the evolution (emergence, disappearance, etc.) of the discovered clusters up to T_{now} + Δt with step s, in accordance with the aforementioned three thresholds.
In particular, given two clusters C and \(C^{\prime }\) from timeslices T_{i} and T_{i+ 1}, the transitions that a cluster might encompass, are as follows:

Survival: C ∈ T_{i} survives into \(C^{\prime } \in T_{i+1}\) iff \(C^{\prime }\) is the match for C and there is no other cluster in T_{i} for which \(C^{\prime }\) is the match.

Absorption: C ∈ T_{i} is absorbed into \(C^{\prime } \in T_{i+1}\) iff \(C^{\prime }\) is the match for C and there is at least one additional cluster in T_{i} for which \(C^{\prime }\) is the match.

Split: C ∈ T_{i} is split into \(\lbrace C_{1}^{\prime }, C_{2}^{\prime }, \dots , C_{n}^{\prime }\rbrace \in T_{i+1}, n > 1\) iff the overlap of C to each of the clusters \(C_{i}^{\prime }\) is greater than a threshold τ_{split} and the overlap of all the clusters \(C_{i}^{\prime }\) together with C is greater than a threshold τ_{match}.

Disappearance: C ∈ T_{i} disappears at T_{i+ 1} when C is not in the set of discovered clusters at T_{i+ 1}

Emergence: C ∈ T_{i+ 1} emerges at T_{i+ 1} when C is not in the set of discovered clusters at T_{i}
If we recall Fig. 1, it provides an illustration of Definition 4. More specifically, we know the movement of nine objects from T_{1} until T_{3} and (via EvolvingClusters with c = 3 and d = 2) the five evolving clusters that are formed: P_{1} = 〈{a,b,c,d,e},T_{1},T_{3},2〉, P_{2} = 〈{a,b,c},T_{1},T_{3},1〉, P_{3} = 〈{b,c,d,e},T_{1},T_{3},1〉, P_{4} = 〈{g,h,i},T_{1},T_{3},1〉, where tp = 1(2) corresponds to MC (MCS, respectively). Our goal is to predict the anticipated evolution of ECs until T_{5}, i.e., P_{1},P_{2},P_{3},P_{4} will continue to exist and two new patterns P_{5} = 〈{f,g,h,i},T_{4},T_{5},1〉, and P_{6} = 〈{c,d,e},T_{1},T_{5},1〉 will emerge.
4 Methodology
In this section, we present the proposed methodology in order to address the CPP and CEP problems, as defined in Definitions 4 and 5, respectively. The aforementioned framework follows the popular lambda architecture designed to handle massive quantities of data by taking advantage of both batch and streamprocessing methods [21]. This architectural paradigm to balance latency, throughput, and faulttolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using realtime stream processing to provide views of online data. Figure 2 illustrates the architecture of our proposed methodology, consisting of an offline and an online layer. More specifically, at the offline layer we train our FLPoffline model, given a historical trajectory dataset, whereas at the online layer we receive the streaming GPS locations, predict the next objects’ location (FLPonline module), and discover ECs at each timeslice. Finally, we compare the current with the predicted ECs, and output our prediction regarding the evolving clusters’ evolution [22, 27] over time.
In the subsections that follow, we give more insights regarding the modules comprising our framework in Fig. 2. More specifically, in Sections 4.1 and 4.2 we describe two alternative models that can be used for FLP, while in Sections 4.3 and 4.4, we describe the modules in charge of EC discovery and cluster evolution, respectively.
4.1 GRUbased future location prediction
The first alternative to address the FLP problem (called GRUFLP) employs a GRUbased method, utilizing the works done in [7, 30]. Subsequently, we briefly state the update rules for the employed GRU layer [5, 37].
where z and r represent the update and reset gates, respectively, \(\tilde {\mathbf {h}}\) and h represent the intermediate memory and output, respectively. Also, in these equations, the W_{∗} variables are the weight matrices and the b_{∗} variables are the biases. Moreover, \(\tilde {\mathbf {p}}\) represents the input, which is composed of the differences in space (Δx, Δy), the difference in time (Δt), and the time horizon of the prediction (Δt_{next}).
Based on [7, 30], our GRUbased NN architecture (GRUFLP) is composed of the following layers: a) an input layer of four neurons, one for each input variable, b) a single GRU hidden layer composed of 150 neurons, c) a fullyconnected hidden layer composed of 50 neurons, and d) an output layer of two neurons, one for each prediction coordinate (Δx_{next}, Δy_{next}); a schematic overview of the proposed network architecture is presented in Fig. 3. Given the two most recent locations p_{i− 1},p_{i} of an object’s trajectory, we predict Δx_{next} and Δy_{next}, which afterwards are summed with p_{i}, thus producing the predicted location of the aforementioned object.
4.2 Subtrajectory based future location prediction
The second alternative to address the FLP problem (called STFLP) employs historical mobility patterns by utilizing the work done in [25]. In more detail, the problem is divided in its offline phase, which is responsible for identifying patterns of movement, and its online phase, which is responsible for predicting the future location of a moving object, given a lookahead time and the set of patterns identified during the offline phase, as depicted in Fig. 4.
In more technical depth, the offline component consists of a distributed storage file system, which contains accumulated historical mobility data, and the Distributed Subtrajectory Pattern Extraction module, which takes as input a distributed trajectory dataset from the distributed file system and identifies a set of subtrajectory pattern. A predictor is built for each moving object by taking into account its individual past movement. Regarding the online component, it receives as input streams of mobility data, concerning the recent positions of moving objects and, for each moving object the corresponding subtrajectory pattern is retrieved, its kmost recent positions are matched with the most similar pattern, and the object’s future location up to the given lookahead time Δt is predicted.
4.3 Evolving clusters discovery
After receiving the predicted locations for each moving object, we use EvolvingClusters [35] in order to shape the predicted comovement patterns. Because the sampling rate may vary for each moving object, we use linear interpolation to temporally align the predicted locations at a common timeslice with a stable sampling (alignment) rate sr.
Given a timeslice T_{now}, EvolvingClusters works in a nutshell, as follows:

Calculates the pairwise distance for each object within T_{now}, and drop the locations with distance less than 𝜃;

Creates a graph based on the filtered locations, and extract its Maximal Connected Subgraphs (MCS) and Cliques (MC) with respect to c;

Maintains the currently active (and inactive) clusters, given the MCS and MC of T_{now} and the recent (active) pattern history; and

Outputs the eligible active patterns with respect to c,t and 𝜃.
The output of EvolvingClusters, and by extension of the whole predictive model, is a tuple of four elements (o_{ids}, t_{s}, t_{e}, tp), where o_{ids} is the set of objects that form an evolving cluster, t_{s} and t_{e}, its corresponding start and ending timestamp, resprectively, with tp noting its type.
For instance, the final output of the model for the example of Fig. 1 is a set of 4element tuples, i.e., {(P_{1},T_{1},T_{5},2),(P_{2},T_{1},T_{5},1), \((P_{3}, T_{1}, T_{4}, 1), (P_{3}, T_{1}, T_{5}, 2), (P_{4}, T_{1}, T_{5}, 1) \rbrace \bigcup \lbrace (P_{5}, T_{4}, T_{5}, 1), (P_{6}, T_{1}, T_{5}, 1) \rbrace\). In other words, we predict that:

P_{1},P_{2},P_{3},P_{4} will survive intact,

P_{3} will become inactive at timeslice T_{4}, but it will remain active as an MCS at timeslice T_{5}, and,

two new clusters P_{5},P_{6} will be discovered at timeslice T_{5}
4.4 Cluster evolution
While the prediction of the course of a cluster is an important task, of equal importance is the tracking of the clusters’ transitions, so as to derive some more salient conclusions regarding the clusters’ movement.
In order to track the clusters’ evolution through time, we use the MONIC/FINGERPRINT framework [22, 27], which proposes a taxonomy of cluster transitions, differentiating between external and internal transitions, that concern the relationship of a cluster to other clusters, and itself, respectively. In contrast to other graphbased methods such as the Reeb graph [3] that are intended (mainly) for offline processing, MONIC can be used within online workflows, therefore rendering it a good option for predicting the clusters’ evolution through time in real time fashion.
As already discussed in Section 3, given two clusters C and C^{′} from timeslices T_{i} and T_{i+ 1}, respectively, the transitions that a cluster might encompass are among the following: survival, absorption, split, disappearance, and emergence.
To measure the similarity between C and \(C^{\prime }\), the authors in [22, 27] use the following formula, which in a nutshell, is a variant of the wellknown Jaccard index, a metric which we will use extensively in the following sections.
In our running example (c.f. Fig. 1), the expected output of MONIC (for the discovered MCS) is that clusters P_{1}, and P_{4} emerge at T_{1} and survive up to T_{3}. From that point, after we use our FLP model, we predict that cluster P_{1} will survive up to T_{5}, cluster P_{6} will emerge at T_{5}, and cluster P_{4} will be absorbed by cluster P_{5} at T_{4}, which in turn will survive up to T_{5}.
4.5 Evaluation measures
The evaluation of a comovement pattern prediction approach is not a straightforward task, since we need to define how the error between the predicted and the actual clusters will be quantified. Intuitively, our aim is to match each predicted cluster C_{pred} with the most similar actual one cluster C_{act}. Towards this direction, we adopt, and then combine, three similarity measures, namely, spatial, temporal, and membership similarity. Concerning the spatial similarity, it is defined as follows:
where MBR(C_{pred}), (MBR(C_{act})) is the spatial coverage of the predicted cluster (actual cluster, respectively), in other words, the Minimum Bounding Rectange (MBR) covering the objects’ location in the given timeslice.
Regarding the temporal dimension, we adopt Allen’s interval algebra [2] and calculate the temporal similarity as follows:
where Interval(C_{pred}), (Interval(C_{act})) is the time interval when the predicted cluster (actual cluster, respectively) was valid.
for the membership similarity, we adopt the Jaccard similarity:
Finally, we define the overall similarity between two clusters as:
where \(\lambda _{1} + \lambda _{2} + \lambda _{3} = 1,\ \lambda _{i} \in \left (0, 1\right ),\ i \in \lbrace 1, 2, 3 \rbrace\).
This further implies that a predicted cluster should be correctly matched with the corresponding actual cluster, which is not a straightforward procedure. Our methodology for matching each predicted cluster C_{pred} with the corresponding actual one C_{act} is depicted in (9).
In other words, we compare each C_{pred} with every C_{act} and “match” it with the most similar one. After all predicted clusters get traversed, we end up with the set Sim(EC_{pred},EC_{act}) which holds all the “matchings”.
5 Experimental study
In this section, we evaluate our predictive model on two different mobility domains, namely maritime and urban transportation, and present our experimental results.
5.1 Datasets & preprocessing
From the maritime domain, we use the “Piraeus”^{Footnote 1} dataset [34], which consists of 103 million AIS positioning messages from 5,163 vessels (passenger boats, fisheries, cargo, containers, etc) in the wider region of Piraeus, Greece. The dataset ranges in time and space, as follows:

temporal range: January 1^{st}, 2018 – December 31^{st}, 2018 (1 year);

spatial range: longitude in [22.992, 24.031]; latitude in [37.437, 38.046].
During the preprocessing stage, we drop erroneous records (i.e. GPS locations) based on a speed threshold speed_{max} as well as stop points (by removing records corresponding to speed less than 1 knot [7]). Although it may sound optional, the latter is a critical cleansing process in order for the ML model to avoid considering nonevolving parts of the trajectories); afterwards we organize the cleansed data into trajectories based on the temporal interval between two consecutive signals of the same vessel, given a threshold dt. Finally, in order to discover evolving clusters, we need a stable and temporally aligned sampling rate. For the aforementioned dataset, we set the following thresholds: speed_{max} = 50 knots, dt = 30 min., and rate = 1 min. The rationale behind these thresholds stems from the characteristics of the dataset, which were unveiled after a statistical analysis of the distribution of the speed and dt between consecutive points of the same trajectory.
From the urban domain, we use the popular “GeoLife”^{Footnote 2} dataset [40,41,42], which contains information regarding urban movement (mainly) in Beijing, China. In particular, it consists of approximately 23.5 million records from 181 users organized in 18,454 trips, with some of their trips being semantically annotated with respect to the means of transportation they used at that time. The dataset ranges in time and space as follows:

temporal range: April 12^{th}, 2007 – July 27^{th}, 2012 (approx. 5 years);

spatial range: longitude in [73.50, 134.78], latitude in [17.96, 53.56] (restricted to China).
Regarding the GeoLife dataset, we omit the preprocessing step and accept it ‘asis’. The main reason for this, is due to the fact that the raw points are already organized into trips and are neatly labeled with respect to the vehicle used. That being said, the only preprocessing we performed was to restrict its spatial range to the China region.
A map visualization of (parts of) the two datasets is illustrated in Fig. 5. In particular, Fig. 5 (left) visualizes Piraeus’ locations on July 10^{th}, 2018, while Fig. 5 (right) visualizes GeoLife locations on October 2022^{nd}, 2011.
5.2 Experimental setup and results
The objectives of our experimental study are related to:

the efficiency of our framework in terms of run time;

FLP prediction quality;

its effect on EC prediction quality;

the sensitivity of (8) with respect to λ_{i}; and

quality assessment of the clusters’ predicted evolution.
The GRUFLP and EC algorithms were implemented in Python3 (via Anaconda virtual environments), while the STFLP algorithm was implemented in Java. The experiments were conducted using Apache Kafka with 1 topic for the transmitted (loaded from a CSV file) and predicted locations, as well as 1 consumer for FLP and EC, respectively.
For the ECrelated experiments, we utilized a single node with 8 CPU cores, 16 GB of RAM and 256 GB of HDD. For STFLP, we employed a 49 node Hadoop 2.7.2 cluster, where the master node consists of 8 CPU cores, 8 GB of RAM and 60 GB of HDD while each slave node is comprised of 4 CPU cores, 4 GB of RAM and 60 GB of HDD. The offline component was implemented over Apache Hadoop and the online component by utilizing Apache Kafka for messaging and Kafka Consumers interface for stream processing. All of the above are provided by okeanosknossos,^{Footnote 3} an IAAS service for the Greek Research and Academic Community. Finally, for the GRUFLP, we used a single node equipped with 16 CPU cores, 64 GB of RAM, 1 TB of HDD, and an Nvidia GTX 2080Ti.
In the sections that follow, we provide the experimental results of our study, using the default parameters of EvolvingClusters per dataset, as defined by [35], as Table 1 illustrates. The respective sourcecode is available at https://github.com/DataStoriesUniPi/EvolvingClusters, as a branch of the general Evolving Clusters related library of operations.
5.2.1 Efficiency of our framework towards online processing
It is shown that EC, GRUFLP, and STFLP algoritmhs are able to run in online mode in all corresponding realworld situations that are experimented with [7, 30, 35], with response time ≈ 250 ms for predicting the vessels’ locations at a single temporal instance. However, when combined, the time complexity of the proposed framework is overwhelmed by the most ‘expensive’ steps, namely EC and CE.
Nevertheless, according to our experiments, our framework is able to run in online mode in all experimental realworld scenarios. In detail, focusing on the entire temporal horizon of both datasets, and running our framework with an activation step of 60 s, Fig. 6 illustrates the (average recorded) response time for its most “expensive” steps, namely, EC and CE.
As it appears in the chart, the response time for EC (c.f. Fig. 6a) remains below 400 ms for thousands of concurrent objects and increases up to (approx.) 1400 ms for 3,527 objects, which corresponds to the most crowded timeslice in our experimental study. Similar results are also recorded for CE, albeit with slightly increased response, with up to 1000 ms when considering ≈ 128 clusters, and up to 2500 ms for ≈ 256 clusters.
Summarising, although it takes some time to train the underlying FLP models, the response time of our framework in its online part is typically below 4 sec per monitored timeslice, thus wellplacing it within the online method family. Last but not least, it is obvious that the performance of our framework is directly correlated to the hardware of the experimental setup.
5.2.2 FLP prediction quality
In our method, predicting evolving clusters relies heavily on the FLP model. In this context, we compare two models, namely GRUFLP (c.f. Section 4.1) and STFLP (c.f. Section 4.2), and assess their impact on EC discovery. Figure 7 illustrates the distribution of the predicted locations’ Mean Absolute Error (MAE) versus the prediction lookahead Δt. We observe that as we increase Δt, the error increases rapidly for the former, going from hundreds of meters (Δt = 1) to tens of kilometers (Δt = 10). On the other hand, for the Subtrajectory based model the error is wellmaintained to at most 1,000 meters for the Piraeus and GeoLife dataset, respectively, with slight increases due to increasing Δt, thus rendering the Subtrajectory based model the better model for the datasets and task at hand.
5.2.3 EC prediction quality
Having evaluated the quality of our predictive models, we proceed to assess the quality of the predicted evolving clusters for our models on both mobility domains with respect to their corresponding “ground truth”. As “ground truth” we define the evolving clusters discovered using the objects’ corresponding actual positions, and for quantifying the accuracy of a predicted cluster C_{pred} against an actual cluster C_{act} we use (9), with λ_{i},i ∈ [1,3] set to their corresponding values. Without loss of generality, Fig. 8 illustrate the EC prediction quality of both predictive models over the Piraeus dataset, where solid lines correspond to the median accuracy, and the confidence intervals to the 25^{th} and 75^{th} quartile, respectively.
Observing the aforementioned figure, we clearly deduce that as we increase the lookahead (Δt), STFLP consistenly outperforms GRUFLP (with median accuracy close to 70%) in all aspects of cluster matching, except for sim^{member}, where GRUFLP outerforms STFLP by a significant margin. More specifically, the performance of STFLP on both sim^{temp} and sim^{spatial} show similar behaviour, decreasing with similar rate as Δt increases, a behaviour which is seemingly correlated to the FLP models’ performance.
Because, in general, both datasets are quite challenging, as they contain a higher degree of unpredictability (due to the amount of manouvers each trajectory has), we observe that both GRUFLP and STFLP models have quite large confidence interval ranges, indicating that there is room for further tuning.
5.2.4 EC similarity score sensitivity
One of the main claimed contributions in our work is the proposed EC similarity score (c.f. (8)) for matching the predicted ECs with the actual ones, and this measure takes the form of a convex combination (weights \(\lambda _{1}, \lambda _{2}, \lambda _{3}, {\sum }_{i} \lambda _{i} = 1\)) of spatial, temporal, and membership similarities, respectively. The values of λ_{i} are fixed in their corresponding default values (i.e. λ_{i} = 1/3) by taking into account all matching aspects (namely, temporal, spatial, and membership) in a balanced manner (i.e., average).
To assess the effect of these weights, Fig. 9(ac) illustrates the distribution of EC similarity score, on the Piraeus dataset (Δt = 1 min.) with respect to (a) λ_{2},λ_{3}, (b) λ_{1},λ_{3}, and (c) λ_{1},λ_{2}. Observing the aforementioned figure, we deduce that the temporal aspect (λ_{1}) is the most dominant, as all other aspects (namely, spatial and membership) have minor impact on the final score, while λ_{1} = 1/3.
This behaviour is expected, as the predicted comovement pattern may demonstrate a temporal and/or membership overlap with its actual one, but, due to the models’ displacement error, exhibit minor spatial overlap, an increasing behaviour, especially for higher Δt, e.g., 15 min. On the other hand, fixing λ_{2}, and λ_{3}, we observe that as we increase the value of λ_{1}, the similarity score decreases with an  almost  steady rate, further showing the tradeoff of focusing more on a single similarity aspect over another. Similar observations can be observed on the GeoLife dataset (c.f. Fig. 9(df)), albeit with steeper increase/decrease rates, as we increase λ_{2} and λ_{3}, respectively.
5.2.5 Cluster evolution accuracy
After evaluating the EC prediction within the spatiotemporal dimension, we may assess the prediction accuracy regarding the clusters’ evolution. As defined in Section 4.4, a cluster C will emerge, survive for some time duration Δt, and then dissapear, or be absorbed into another cluster \(C^{\prime }\) or split into n clusters \(C_{1}, C_{2}, \dots , C_{n}\).
Exploiting on the MONIC/FINGERPRINT framework [22, 27], we create a graph G = (V,E), with its nodes corresponding to the aforementioned ECs (〈P_{i},T_{i},T_{j},tp〉,j > i) and its edges corresponding to the event (i.e., emerged, survived, absorbed, split, disappeared) which occurred between these two clusters. For instance, consider the setting that appears in Fig. 1 for tp = 2 (MCS); the respective graph G of cluster evolutions is illustrated in Fig. 10a. This graph represents the actual situation of cluster evolution; hence it is G_{act}. On the other hand, let us assume that our framework predicted a different situation; more specifically, at T_{5} all clusters were combined into a single entity P_{7} = 〈{a,b,c,d,e,f,g,h,i},T_{5},T_{5},2〉, which is represented in G_{pred}, as illustrated in Fig. 10b.
In order to compare the evolution over time between a G_{act} and G_{pred}, we exploit on the Graph Edit Distance (GED) measure. In particular, GED is a graph similarity measure analogous to Levenshtein distance for strings, defined as the minimum cost of edit path (sequence of node and edge edit operations) transforming a graph G1 to another graph G2, such that G1 and G2 are isomorphic [1]. Because the resulted graphs may significantly vary in their corresponding size, thus resulting in confusion regarding the final score, we use the normalized GED (nGED – c.f. (10)) in order to bound the scores between zero and one, or in other words, between full and none isomorphism, respectively.
where G_{actual}, G_{pred} the evolution graphs of C_{act} and C_{pred}, with V, and E, their corresponding vertices and edges, respectively.
After comparing the evolution graphs for each predicted EC with its corresponding actual one, Table 2 illustrates the mininum, median, and maximum nGED, respectively. For the Piraeus dataset, we can clearly observe that in both FLP models, as Δt increases so does the nGED, albeit with some minor noise, between the actual and predicted EC evolution graphs.
Moreover, for the GeoLife dataset, we observe similar behaviour but with less isomorphic graphs, a result correlated with the models’ performance over the aforementioned datasets. These conclusions are in accordance with the ones derived from Fig. 8, further strengthening the (intuitively obvious) correlation between FLP accuracy and EC prediction, even at higher levels of abstraction (i.e., cluster evolution).
5.3 Towards exploiting the predicted clusters
A very important stage in the data analytics hierarchy is data storytelling; in other words, how to deduce meaningful and easy to interpret outcomes from a (curated) dataset at hand. After executing the EC module on both datasets for the actual and predicted locations, respectively, we present some further findings, aiming to reach some insightful conclusions regarding the objects’ future activity.
Towards this direction, we construct heatmaps of the discovered ECs using the cardinality of each cluster as the measurable magnitude and setting the parameters of EC to their corresponding default values (c.f. Table 1). In particular, Figures 9 and 10 illustrate two heatmaps of actual (left) and predicted ECs (right) in indicative regions of Piraeus dataset (cells of size 225x225 m) and GeoLife dataset (cells of size 256x256 m), respectively. In order to avoid repetitions, we choose to focus our discussion on Δt = 1.
Starting with the Piraeus dataset, if we are interested in predicting the vessels’ port entry/exit routes, Fig. 11 (top) provides an answer: it can be observed that our GRUbased model predicted that most traffic will be concentrated at the ferry lines connecting the ports at peak hours, with emphasis on the ferry line connecting Salamina with Perama. Even further, focusing on the STFLP model, we not only observe the same behaviour (albeit with higher resolution), but also we can accurately predict the intraport activity, a conclusion that further empowers our confidence towards a predictive framework able to correctly anticipate incoming vessels, thus effectively managing port traffic. Except reallife applications, the aforementioned findings may trigger domain experts into further investigating these occurences and reach some meaningful conclusions.
Moving to the urban domain and regarding the GeoLife dataset, as Fig. 12 illustrates, most traffic lies around some places of interest (such as parks, shops, and university campuses, as it can be easily deduced when comparing the spatial footprint of the discovered evolving clusters with open spatial data sources, like Wikimapia.org, LinkedGeoData.org, etc.). More specifically, focusing on the nothern part of Beijing, the GRUFLP model predicts not only moderate traffic at the aforementioned areas, but also mild traffic on surrounding roads at peak hours. Similar results are also obtained at the southwestern side of Beijing using the STFLP model. In general, combined with the hours they are created, a domain expert (e.g. transportation analyst) could easily determine where and when a congestion will occur, thus using our predictive models as a utility to effectively reroute urban traffic.
6 Conclusions and future work
In this paper, we propose a unified framework for the online prediction of evolving clusters and their evolution over time. The proposed method is based on a combination of FLP and EC algorithms and is evaluated through two realworld datasets from the maritime and urban domain, respectively, taking into account a novel comovement pattern similarity measure, which is able to match the predicted clusters with the actual ones. Our experimental study on reallife datasets demonstrates the efficiency and effectiveness of the proposed methodology. Thus, based on the potential applications, as well as the quality of the results produced, we believe that the proposed model can be a valuable utility for researchers and practitioners alike.
In the near future, we aim to further tune our models with respect to their respective architecture and/or parameters, and experiment on more datasets from other domains (e.g. sports analytics) in order to demonstrate the versatility of our framework.
Data availability
The authors declare that all data supporting the findings of this study are available within the article.
Notes
The dataset is publicly available at https://zenodo.org/record/4498410
The dataset is publicly available at https://research.microsoft.com/enus/downloads/b16d359dd164469e9fd4daa38f2b2e13/
References
AbuAisheh Z, Raveaux R, Ramel J, Martineau P (2015) An exact graph edit distance algorithm for solving pattern recognition problems. In: ICPRAM (1). SciTePress, pp 271–278
Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832–843
Buchin K, Buchin M, van Kreveld MJ, Speckmann B, Staals F (2013) Trajectory grouping structure. In: WADS, Lecture Notes in Computer Science, vol 8037. Springer, pp 219–230
Chen L, Gao Y, Fang Z, Miao X, Jensen CS, Guo C (2019) Realtime distributed comovement pattern detection on streaming trajectories. Proc VLDB Endow 12(10):1208–1220
Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1724–1734
Chondrodima E, Georgiou H, Pelekis N, Theodoridis Y (2022) Particle swarm optimization and rbf neural networks for public transport arrival time prediction using gtfs data. Int J Inf Manag Data Insights 2(2):100086
Chondrodima E, Mandalis P, Pelekis N, Theodoridis Y (2022) Machine learning models for vessel route forecasting: An experimental comparison. In: Proceedings of the 23rd IEEE international conference on mobile data management (MDM)
Dong Y, Pinelli F, Gkoufas Y, Nabi Z, Calabrese F, Chawla NV (2015) Inferring unusual crowd events from mobile phone call detail records. In: Proceedings of the joint european conference on machine learning and knowledge discovery in databases (ECMLPKDD), pp 474–492
Fan Q, Zhang D, Wu H, Tan K (2016) A general and parallel platform for mining comovement patterns over largescale trajectories. Proc VLDB Endow 10(4):313–324
Fang Z, Gao Y, Pan L, Chen L, Miao X, Jensen CS (2020) CoMing: A realtime comovement mining system for streaming trajectories. In: Proceedings of ACM SIGMOD/PODS international conference on management of data, pp 2777–2780
Georgiou HV, Pelekis N, Sideridis S, Scarlatti D, Theodoridis Y (2020) Semanticaware aircraft trajectory prediction using flight plans. Int J Data Sci Anal 9(2):215–228
Hao Z, Huang X, Wang K, Cui M, Tian Y (2020) Attentionbased GRU for driver intention recognition and vehicle trajectory prediction. In: Proceedings of the 4th CAA international conference on vehicular control and intelligence (CVCI), pp 86–91
Helmi S, Kashani FB (2020) Multiscale frequent comovement pattern mining. In: Proceedings of the 36th IEEE international conference on data engineering (ICDE), pp 829–840
Jeung H, Yiu ML, Zhou X, Jensen CS, Shen HT (2008) Discovery of convoys in trajectory databases. PVLDB 1(1):1068–1080
Kalnis P, Mamoulis N, Bakiras S (2005) On discovering moving clusters in spatiotemporal data. In: Proceedings of the 17th international symposium on spatial and temporal databases SSTD, pp 364–381
Kannangara S, Xie H, Tanin E, Harwood A, Karunasekera S (2020) Tracking group movement in location based social networks. In: Proceedings of the international conference on advances in geographic information systems (SIGSPATIAL/GIS), pp 251–262
Laube P, Imfeld S, Weibel R (2005) Discovering relative motion patterns in groups of moving point objects. Int J Geogr Inf Sci 19(6):639–668
Li Y, Bailey J, Kulik L (2015) Efficient mining of platoon patterns in trajectory databases. Data Knowl Eng 100:167–187
Li Z, Ding B, Han J, Kays R (2010) Swarm: Mining relaxed temporal moving object clusters. PVLDB 3(1):723–734
Liu H, Wu H, Sun W, Lee I (2019) Spatiotemporal GRU for trajectory classification. In: Proceedings of IEEE international conference on data mining (ICDM), pp 1228–1233
Marz N (2011) How to beat the CAP theorem. In: thoughts from the red planet. http://nathanmarz.com/blog/howtobeatthecaptheorem.html. Accessed 4 Jul 2022
Ntoutsi E, Spiliopoulou M, Theodoridis Y (2012) FINGERPRINT: summarizing cluster evolution in dynamic environments. Int J Data Warehous Min 8 (3):27–44
Papadimitrakis M, Stogiannos M, Sarimveis H, Alexandridis A (2021) Multiship control and collision avoidance using mpc and rbfbased trajectory predictions. Sensors 21(21):6959
Petrou P, Nikitopoulos P, Tampakis P, Glenis A, Koutroumanis N, Santipantakis GM, Patroumpas K, Vlachou A, Georgiou HV, Chondrodima E, Doulkeridis C, Pelekis N, Andrienko GL, Patterson F, Fuchs G, Theodoridis Y, Vouros GA (2019) ARGO: A big data framework for online trajectory prediction. In: Proceedings of the 16th international symposium on spatial and temporal databases (SSTD), pp 194–197
Petrou P, Tampakis P, Georgiou HV, Pelekis N, Theodoridis Y (2019) Online longterm trajectory prediction based on mined route patterns. In: Proceedings of the MultipleAspect Analysis of Semantic Trajectories  First International Workshop, MASTER 2019, Held in Conjunction with ECMLPKDD, pp 34–49
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by backpropagating errors. Nature 323:533–536
Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) MONIC: modeling and monitoring cluster transitions. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp 706–711
Stogiannos M, Papadimitrakis M, Sarimveis H, Alexandridis A (2021) Vessel trajectory prediction using radial basis function neural networks. In: IEEE EUROCON 2021  19th international conference on smart technologies, pp 113–118
Suo Y, Chen W, Claramunt C, Yang S (2020) A ship trajectory prediction framework based on a recurrent neural network. Sensors 20(18)
Tampakis P, Chondrodima E, Tritsarolis A, Pikrakis A, Theodoridis Y, Pristouris K, Nakos H, Kalampokis P (2022) Dalamagas, T.: i4sea: a big data platform for sea area monitoring and analysis of fishing vessels activity. Geospatial Inf Sci 25(2):132–154
Tampakis P, Pelekis N, Doulkeridis C, Theodoridis Y (2019) Scalable distributed subtrajectory clustering. In: Proceedings of 2019 IEEE international conference on big data (Big Data), pp 950–959
Tang LA, Zheng Y, Yuan J, Han J, Leung A, Hung C, Peng W (2012) On discovery of traveling companions from streaming trajectories. In: Proceedings of the 28th IEEE international conference on data engineering (ICDE), pp 186–197
Trasarti R, Guidotti R, Monreale A, Giannotti F (2017) MyWay: Location prediction via mobility profiling. Inf Syst 64:350–367
Tritsarolis A, Kontoulis Y, Theodoridis Y (2021) The piraeus ais dataset for largescale maritime data analytics [data set]. https://doi.org/10.5281/zenodo.5562629
Tritsarolis A, Theodoropoulos GS, Theodoridis Y (2021) Online discovery of comovement patterns in mobility data. Int J Geogr Inf Sci 35(4):819–845
Wang C, Ren H, Li H (2020) Vessel trajectory prediction based on ais data and bidirectional gru. In: Proceedings of international conference on computer vision, image and deep learning (CVIDL), pp 260–264
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560
Zhang Y, Zheng L (2021) Pedestrian trajectory prediction with MLPSocialGRU. In: Proceedings of the 13th international conference on machine learning and computing (ICMLC), pp 368–372
Zheng K, Zheng Y, Yuan NJ, Shang S (2013) On discovery of gathering patterns from trajectories. In: Proceedings of the 29th IEEE international conference on data engineering (ICDE), pp 242–253
Zheng Y, Li Q, Chen Y, Xie X, Ma W (2008) Understanding mobility based on GPS data. In: Proceedings of the 10th international conference on ubiquitous computing (UbiComp), pp 312–321
Zheng Y, Xie X, Ma W (2010) Geolife: A collaborative social networking service among user, location and trajectory. IEEE Data Eng Bull 33(2):32–39
Zheng Y, Zhang L, Xie X, Ma W (2009) Mining interesting locations and travel sequences from GPS trajectories. In: Proceedings of the 18th International Conference on World Wide Web (WWW), pp 791–800
Funding
Open access funding provided by HEALLink Greece This work was partially supported by project VesselAI (grant agreement No 957237; https://vesselai.eu), which has received funding by the EU Horizon 2020 R&I Programme.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Tritsarolis, A., Chondrodima, E., Tampakis, P. et al. Predicting Comovement patterns in mobility data. Geoinformatica (2022). https://doi.org/10.1007/s1070702200478x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s1070702200478x
Keywords
 Machine learning
 Predictive analytics
 Comovement patterns
 Trajectory prediction
 Cluster evolution