1 Introduction

The vast spread of GPS-enabled devices, such as smartphones, tablets, and GPS trackers, has led to the production of large amounts of mobility related data. By nature, this kind of data is streaming and there are several application scenarios where the processing needs to take place in an online fashion. These properties have posed new challenges in terms of efficient storage, analytics, and knowledge extraction out of such data. One of these challenges is online cluster analysis, where the goal is to unveil hidden patterns of collective behaviour from streaming trajectories, such as co-movement patterns [4, 9, 10, 13, 35]. What is even more challenging is predictive analytics over mobility data, where the goal is to predict the future behaviour of moving objects, which can have a wide range of applications, such as predicting collisions, future encounters, traffic jams, etc. At an individual level, a typical and well-studied example of such analytics is future location prediction [11, 24, 25, 29, 33], where the goal is to predict the future location of a moving object, given a look-ahead time. However, prediction of future mobility behaviour at a collective level has not been addressed at the degree of its individual counterpart.

Concerning the definition of co-movement patterns, there are several approaches in the literature, such as [4, 9, 10, 13]. However, most of the above are either offline and/or operate at predefined temporal snapshots that imply temporal alignment and uniform sampling, which are not realistic assumptions. For this reason, we adopt the approach presented in [35], which, to the best of our knowledge, is the first online method for the discovery of co-movement patterns in mobility data that does not assume temporal alignment and uniform sampling. The goal in [35] is to discover co-movement patterns, namely Evolving Clusters, in an online fashion, by employing a graph-based representation. By doing so, the problem of co-movement pattern detection is transformed into identifying Maximal Cliques (MCs) (for spherical, Flock-like clusters) or Maximal Connected Subgraphs (MCSs) (for density-connected, Convoy-like clusters).

Fig. 1
figure 1

Predicting evolving clusters via trajectory prediction – blue, green, and orange graphs correspond to past, current, and future (i.e., predicted) evolving clusters, respectively

In particular, the problem that we address in this paper is the Online Prediction of Co-movement Patterns. Informally, given a look-ahead time interval Δt, the goal is to predict the clusters of moving objects that are anticipated to be shaped after Δt. Figure 1 illustrates such an example, where in blue and green we have the information at hand (past and current locations / clusters, respectively) whereas in orange we have the predictions. The problem we address is quite challenging, since, apart from the inherent difficulty of predicting the future, we also need to define how the error between the actual and the predicted clusters will be measured. This further implies that a predicted cluster should be correctly “matched” with the corresponding actual cluster, which is not a straightforward procedure at all. To the best of our knowledge, the problem we aim to address has not been addressed in the literature yet.

Several mobility-related applications could benefit from such an analytics task. In the urban traffic domain, predicting co-movement patterns could assist in detecting future traffic jams, which, in turn, could help the authorities take the appropriate measures (e.g. adjusting traffic lights) in order to minimize them. In the maritime domain, a typical problem is illegal transshipment, where groups of vessels move together “close” enough for some time duration and with low speed. It becomes obvious that predicting co-movement patterns could help in predicting illegal transshipment events.

Our main contributions are the following:

  • We provide an efficient solution to the problem of Online Prediction of Co-movement Patterns.

  • We propose a co-movement pattern similarity measure, which helps us “match” the predicted with the actual clusters.

  • We exploit on MONIC/FINGERPRINT [22, 27], a state-of-the-art cluster evolution discovery method, in order to track the clusters’ evolution through time.

  • We extensively evaluate the aforementioned framework using large-volume real-world data from two different mobility domains, namely, maritime and urban transportation.

The rest of the paper is organized as follows. Section 2 discusses related work. In Section 3, we formally define the problem at hand. Subsequently, in Section 4 we propose our methodology and we introduce a co-movement pattern similarity measure along with the cluster “matching” approach. Section 5, presents our experimental study and, finally, in Section 6 we conclude the paper and discuss future extensions.

2 Related work

The work performed in this paper is closely related to three topics, (a) co-movement pattern discovery, (b) future location prediction, and (c) co-movement pattern prediction.

Co-movement pattern discovery

One of the first approaches for identifying collective mobility behaviour is the so-called flock pattern [17], which identifies groups of at least m objects that move within a disk of radius r for at least k consecutive timepoints. Inspired by this, several related works followed, such as moving clusters [15], convoys [14], swarms [19], platoons [18], traveling companion [32] and gathering pattern [39]. Even though all of these approaches provide explicit definitions of several mined patterns, their main limitation is that they search for specific collective behaviours, defined by respective parameters. An approach that defines a new generalized mobility pattern is presented in [9] where the general co-movement pattern (GCMP) is proposed. In [13], the authors propose a frequent co-movement pattern (f-CoMP) definition for discovering patterns at multiple spatial scales, also exploiting the overall shape of the objects’ trajectories, while at the same time it relaxes the temporal and spatial constraints of the seminal works (i.e. Flocks, Convoys, etc.) in order to discover more interesting patterns. The authors in [4, 10], propose a two-phase online distributed co-movement pattern detection framework, which includes the clustering and the pattern enumeration phase, respectively.

Another perspective regarding co-movement pattern discovery is to reduce cluster types into graph properties and view them as such. In [35], the authors propose a novel co-movement pattern definition, called Evolving Clusters (EC), that unifies the definitions of flocks and convoys and reduces them to Maximal Cliques (MC), and Connected Subgraphs (MCS), respectively. In addition, the authors propose an online algorithm, that discovers several evolving cluster types simultaneously in real time, without assuming temporal alignment, in constrast to the seminal works (i.e., flocks, convoys). In our work, we elaborate on evolving clusters [35] for co-movement pattern discovery. The reason why this is the most appropriate, is that we can predict the course of several pattern types at the same time, without the need to call several other algorithms, therefore adding redundant computational complexity.

Future location prediction

The fact that the Future Location Prediction (FLP) problem has been extensivelly studied brings up its importance and applicability in a wide range of applications. Towards tackling the FLP problem, one line of work includes efforts that take advantage of historical movement patterns in order to predict the future location. Such an approach is presented in [33], where the authors propose MyWay, a hybrid, pattern-based approach that utilizes individual patterns when available, and when not, collective ones, in order to provide more accurate predictions and increase the predictive ability of the system. In another effort, the authors in [24, 25] utilize the work done by [31] on distributed subtrajectory clustering in order to be able to extract individual subtrajectory patterns from big mobility data. These patterns are subsequently utilized in order to predict the future location of the moving objects in parallel.

A different way of addressing the FLP problem includes machine learning approaches, such as Recurrent Neural Network (RNN) -based models [26] and Gated Recurrent Units (GRU) [5] models, which constitute the newer generation of RNN or static NNs [6, 28] [23]. In this line of research, Suo et al. [29] presented a GRU model to predict vessel trajectories based on the DBSCAN algorithm to derive main trajectories and a symmetric segmented-path distance approach to eliminate the influence of a large number of redundant data and to optimize incoming trajectories. Liu et al. [20] proposed a trajectory classifier called Spatio-Temporal GRU to model the spatio-temporal correlations and irregular temporal intervals prevalently presented in spatio-temporal trajectories. Particularly, a segmented convolutional weight mechanism was proposed to capture short-term local spatial correlations in trajectories along with an additional temporal gate to control the information flow related to the temporal interval information. Wang et al. [36] proposed a vessel berthing trajectory prediction model based on bidirectional GRU (Bi-GRU) and cubic spline interpolation. Hao et al. [12] proposed a vehicle trajectory prediction encoder-decoder model based on GRU with attention mechanism. Particularly, the proposed model is comprised of intention recognition module and trajectory prediction module. The former module recognizes driver’s intention and calculates the probabilities of turning-left, lane-keeping, turning-right, whereas the trajectory prediction module predicts vehicle trajectory using GRU decoder with attention mechanism, which takes vehicle historical position as input and predicts future position. Zhang and Zheng [38] proposed a data-driven pedestrian trajectory predictor called MLP-social-GRU. The proposed mechanism first processes a pedestrian trajectory with a Multilayer Perceptron (MLP), and then, it adopts GRU to get hidden features of a pedestrian motion patterns, from which relationships between pedestrians can be simulated.

Co-movement pattern prediction

The most similar to our work is presented in [16]. More specifically, the authors in [16], divide time into time slices of fixed step size and adopt a spherical definition of groups, where each group consists of moving objects that are confined within a radius d and their goal is to predict the centroid of the groups at the next timeslice. However, the group definition adopted in [16] is rather limited, since it identifies only spherical groups, as opposed to [8] where cylindrical clusters get detected and to [35] where both spherical and density-connected clusters can be identified. In addition, the Group Kalman Filter (GKF) they propose, while novel, it tracks only the centroid of each pattern, and not their shape and membership.

3 Problem definition

Before we proceed to the formulation of the problem, let us provide some preliminary definitions.

Definition 1

(Trajectory) A trajectory \(T = \{p_{1}, {\dots } p_{n}\}\) of a moving object is considered as a sequence of timestamped locations, where n corresponds to the latest reported position of T. Further, pi = {xi,yi,ti}, with 1 ≤ in.

Definition 2

(Future Location Prediction - FLP). Given a trajectory Ti and a time interval Δt, the goal of FLP is to predict \(p^{i}_{pred} = \{x^{i}_{pred}, y^{i}_{pred}\}\) at timestamp \(t^{i}_{pred} = t^{i}_{now}+{{{\varDelta }}} t\).

Definition 3

(Evolving Cluster - EC) [35]. Given a dataset D of trajectories, a minimum cardinality threshold c, a maximum distance threshold 𝜃, and a minimum time duration threshold d, an EC 〈C,tstart,tend,tp〉 is a subset CD of the moving objects’ population, \(\lvert C \rvert \geq c\), which appeared at time point tstart and remained alive until time point tend (with tendtstartd) during the lifetime [tstart,tend] of which the participating moving objects were spatially connected with respect to distance 𝜃 and cluster type tp (1 for maximal cliques - MC - or 2 for maximal connected components - MCS).

Definition 4

(Co-movement Pattern Prediction - CPP). Given a set D of trajectories, a set G of evolving clusters up to timeslice Tnow, a time interval Δt, and a prediction step s (sub-multiple of Δt), the goal of CPP is to predict all valid evolving clusters \(G^{\prime }\) up to Tnow + Δt with step s.

Definition 5

(Cluster Evolution Prediction - CEP). Given a set ECcurr of current ECs, a set ECpred of predicted ECs up to a time interval Δt with step s (see Definition 4), a minimum cluster membership overlap threshold τmatch, a minimum cluster temporal overlap threshold τtemp, and a minimum cluster split overlap threshold τsplit, the goal of CEP is to predict the evolution (emergence, disappearance, etc.) of the discovered clusters up to Tnow + Δt with step s, in accordance with the aforementioned three thresholds.

In particular, given two clusters C and \(C^{\prime }\) from timeslices Ti and Ti+ 1, the transitions that a cluster might encompass, are as follows:

  • Survival: CTi survives into \(C^{\prime } \in T_{i+1}\) iff \(C^{\prime }\) is the match for C and there is no other cluster in Ti for which \(C^{\prime }\) is the match.

  • Absorption: CTi is absorbed into \(C^{\prime } \in T_{i+1}\) iff \(C^{\prime }\) is the match for C and there is at least one additional cluster in Ti for which \(C^{\prime }\) is the match.

  • Split: CTi is split into \(\lbrace C_{1}^{\prime }, C_{2}^{\prime }, \dots , C_{n}^{\prime }\rbrace \in T_{i+1}, n > 1\) iff the overlap of C to each of the clusters \(C_{i}^{\prime }\) is greater than a threshold τsplit and the overlap of all the clusters \(C_{i}^{\prime }\) together with C is greater than a threshold τmatch.

  • Disappearance: CTi disappears at Ti+ 1 when C is not in the set of discovered clusters at Ti+ 1

  • Emergence: CTi+ 1 emerges at Ti+ 1 when C is not in the set of discovered clusters at Ti

If we recall Fig. 1, it provides an illustration of Definition 4. More specifically, we know the movement of nine objects from T1 until T3 and (via EvolvingClusters with c = 3 and d = 2) the five evolving clusters that are formed: P1 = 〈{a,b,c,d,e},T1,T3,2〉, P2 = 〈{a,b,c},T1,T3,1〉, P3 = 〈{b,c,d,e},T1,T3,1〉, P4 = 〈{g,h,i},T1,T3,1〉, where tp = 1(2) corresponds to MC (MCS, respectively). Our goal is to predict the anticipated evolution of ECs until T5, i.e., P1,P2,P3,P4 will continue to exist and two new patterns P5 = 〈{f,g,h,i},T4,T5,1〉, and P6 = 〈{c,d,e},T1,T5,1〉 will emerge.

4 Methodology

In this section, we present the proposed methodology in order to address the CPP and CEP problems, as defined in Definitions 4 and 5, respectively. The aforementioned framework follows the popular lambda architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods [21]. This architectural paradigm to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. Figure 2 illustrates the architecture of our proposed methodology, consisting of an offline and an online layer. More specifically, at the offline layer we train our FLP-offline model, given a historical trajectory dataset, whereas at the online layer we receive the streaming GPS locations, predict the next objects’ location (FLP-online module), and discover ECs at each timeslice. Finally, we compare the current with the predicted ECs, and output our prediction regarding the evolving clusters’ evolution [22, 27] over time.

In the subsections that follow, we give more insights regarding the modules comprising our framework in Fig. 2. More specifically, in Sections 4.1 and 4.2 we describe two alternative models that can be used for FLP, while in Sections 4.3 and 4.4, we describe the modules in charge of EC discovery and cluster evolution, respectively.

Fig. 2
figure 2

Workflow for EC evolution prediction via FLP

4.1 GRU-based future location prediction

The first alternative to address the FLP problem (called GRU-FLP) employs a GRU-based method, utilizing the works done in [7, 30]. Subsequently, we briefly state the update rules for the employed GRU layer [5, 37].

$$\begin{array}{@{}rcl@{}} \mathbf{z}_{k} &=& \sigma({\mathbf{W}}_{\tilde{\mathbf{p}}z} \cdot \tilde{\mathbf{p}}_{k} + {\mathbf{W}}_{hz} \cdot {\mathbf{h}}_{k-1} + {\mathbf{b}}_{z}) \end{array}$$
$$\begin{array}{@{}rcl@{}} \mathbf{r}_{k} &=& \sigma({\mathbf{W}}_{\tilde{\mathbf{p}}r} \cdot \tilde{\mathbf{p}}_{k} + {\mathbf{W}}_{hr} \cdot {\mathbf{h}}_{k-1} + {\mathbf{b}}_{r}) \end{array}$$
$$\begin{array}{@{}rcl@{}} \tilde{\mathbf{h}}_{k} &= &\tanh({\mathbf{W}}_{\tilde{\mathbf{p}}h} \cdot \tilde{\mathbf{p}}_{k} + {\mathbf{W}}_{hh} \cdot (\mathbf{r}_{k} * {\mathbf{h}}_{k-1}) + {\mathbf{b}}_{h}) \end{array}$$
$$\begin{array}{@{}rcl@{}} \mathbf{h}_{k} &=& \mathbf{z}_{k} \odot \mathbf{h}_{k-1} +(1-\mathbf{z}_{k}) \odot \tilde{\mathbf{h}}_{k} \end{array}$$

where z and r represent the update and reset gates, respectively, \(\tilde {\mathbf {h}}\) and h represent the intermediate memory and output, respectively. Also, in these equations, the W variables are the weight matrices and the b variables are the biases. Moreover, \(\tilde {\mathbf {p}}\) represents the input, which is composed of the differences in space (Δx, Δy), the difference in time (Δt), and the time horizon of the prediction (Δtnext).

Based on [7, 30], our GRU-based NN architecture (GRU-FLP) is composed of the following layers: a) an input layer of four neurons, one for each input variable, b) a single GRU hidden layer composed of 150 neurons, c) a fully-connected hidden layer composed of 50 neurons, and d) an output layer of two neurons, one for each prediction coordinate (Δxnext, Δynext); a schematic overview of the proposed network architecture is presented in Fig. 3. Given the two most recent locations pi− 1,pi of an object’s trajectory, we predict Δxnext and Δynext, which afterwards are summed with pi, thus producing the predicted location of the aforementioned object.

Fig. 3
figure 3

GRU-based FLP architecture

4.2 Sub-trajectory based future location prediction

The second alternative to address the FLP problem (called ST-FLP) employs historical mobility patterns by utilizing the work done in [25]. In more detail, the problem is divided in its offline phase, which is responsible for identifying patterns of movement, and its online phase, which is responsible for predicting the future location of a moving object, given a look-ahead time and the set of patterns identified during the offline phase, as depicted in Fig. 4.

Fig. 4
figure 4

Subtrajectory pattern-based FLP architecture

In more technical depth, the offline component consists of a distributed storage file system, which contains accumulated historical mobility data, and the Distributed Subtrajectory Pattern Extraction module, which takes as input a distributed trajectory dataset from the distributed file system and identifies a set of subtrajectory pattern. A predictor is built for each moving object by taking into account its individual past movement. Regarding the online component, it receives as input streams of mobility data, concerning the recent positions of moving objects and, for each moving object the corresponding subtrajectory pattern is retrieved, its k-most recent positions are matched with the most similar pattern, and the object’s future location up to the given look-ahead time Δt is predicted.

4.3 Evolving clusters discovery

After receiving the predicted locations for each moving object, we use EvolvingClusters [35] in order to shape the predicted co-movement patterns. Because the sampling rate may vary for each moving object, we use linear interpolation to temporally align the predicted locations at a common timeslice with a stable sampling (alignment) rate sr.

Given a timeslice Tnow, EvolvingClusters works in a nutshell, as follows:

  • Calculates the pairwise distance for each object within Tnow, and drop the locations with distance less than 𝜃;

  • Creates a graph based on the filtered locations, and extract its Maximal Connected Subgraphs (MCS) and Cliques (MC) with respect to c;

  • Maintains the currently active (and inactive) clusters, given the MCS and MC of Tnow and the recent (active) pattern history; and

  • Outputs the eligible active patterns with respect to c,t and 𝜃.

The output of EvolvingClusters, and by extension of the whole predictive model, is a tuple of four elements (oids, ts, te, tp), where oids is the set of objects that form an evolving cluster, ts and te, its corresponding start and ending timestamp, resprectively, with tp noting its type.

For instance, the final output of the model for the example of Fig. 1 is a set of 4-element tuples, i.e., {(P1,T1,T5,2),(P2,T1,T5,1), \((P_{3}, T_{1}, T_{4}, 1), (P_{3}, T_{1}, T_{5}, 2), (P_{4}, T_{1}, T_{5}, 1) \rbrace \bigcup \lbrace (P_{5}, T_{4}, T_{5}, 1), (P_{6}, T_{1}, T_{5}, 1) \rbrace\). In other words, we predict that:

  • P1,P2,P3,P4 will survive intact,

  • P3 will become inactive at timeslice T4, but it will remain active as an MCS at timeslice T5, and,

  • two new clusters P5,P6 will be discovered at timeslice T5

4.4 Cluster evolution

While the prediction of the course of a cluster is an important task, of equal importance is the tracking of the clusters’ transitions, so as to derive some more salient conclusions regarding the clusters’ movement.

In order to track the clusters’ evolution through time, we use the MONIC/FINGERPRINT framework [22, 27], which proposes a taxonomy of cluster transitions, differentiating between external and internal transitions, that concern the relationship of a cluster to other clusters, and itself, respectively. In contrast to other graph-based methods such as the Reeb graph [3] that are intended (mainly) for offline processing, MONIC can be used within online workflows, therefore rendering it a good option for predicting the clusters’ evolution through time in real time fashion.

As already discussed in Section 3, given two clusters C and C from timeslices Ti and Ti+ 1, respectively, the transitions that a cluster might encompass are among the following: survival, absorption, split, disappearance, and emergence.

To measure the similarity between C and \(C^{\prime }\), the authors in [22, 27] use the following formula, which in a nutshell, is a variant of the well-known Jaccard index, a metric which we will use extensively in the following sections.

$$\text{MONIC}_{sim}(C, C^{\prime}) = \frac{\lvert C \cap C^{\prime} \rvert}{\lvert C \rvert}$$

In our running example (c.f. Fig. 1), the expected output of MONIC (for the discovered MCS) is that clusters P1, and P4 emerge at T1 and survive up to T3. From that point, after we use our FLP model, we predict that cluster P1 will survive up to T5, cluster P6 will emerge at T5, and cluster P4 will be absorbed by cluster P5 at T4, which in turn will survive up to T5.

4.5 Evaluation measures

The evaluation of a co-movement pattern prediction approach is not a straightforward task, since we need to define how the error between the predicted and the actual clusters will be quantified. Intuitively, our aim is to match each predicted cluster Cpred with the most similar actual one cluster Cact. Towards this direction, we adopt, and then combine, three similarity measures, namely, spatial, temporal, and membership similarity. Concerning the spatial similarity, it is defined as follows:

$$Sim^{spatial}(C_{pred}, C_{act}) = \frac{MBR(C_{pred}) \bigcap MBR(C_{act})}{MBR(C_{pred}) \bigcup MBR(C_{act})}$$

where MBR(Cpred), (MBR(Cact)) is the spatial coverage of the predicted cluster (actual cluster, respectively), in other words, the Minimum Bounding Rectange (MBR) covering the objects’ location in the given timeslice.

Regarding the temporal dimension, we adopt Allen’s interval algebra [2] and calculate the temporal similarity as follows:

$$Sim^{temp}(C_{pred}, C_{act}) = \frac{Interval(C_{pred}) \bigcap Interval(C_{act})}{Interval(C_{pred}) \bigcup Interval(C_{act})}$$

where Interval(Cpred), (Interval(Cact)) is the time interval when the predicted cluster (actual cluster, respectively) was valid.

for the membership similarity, we adopt the Jaccard similarity:

$$Sim^{member}(C_{pred}, C_{act}) = \frac{|C_{pred} \bigcap C_{act}|}{|C_{pred} \bigcup C_{act}|}$$

Finally, we define the overall similarity between two clusters as:

$$Sim^{\ast}(C_{pred}, C_{act}) = \left\{ \begin{array}{ll} \begin{aligned} &\ \lambda_{1} \cdot Sim^{spatial}\ + \\ &\ \lambda_{2} \cdot Sim^{temp}\ + \\ &\ \lambda_{3} \cdot Sim^{member} \end{aligned} &; \begin{aligned} &\ Sim^{temp} > 0\ \land \\ &\ Sim^{member} \geq \tau_{match} \end{aligned} \\ & \\ 0 &; Else \end{array} \right.$$

where \(\lambda _{1} + \lambda _{2} + \lambda _{3} = 1,\ \lambda _{i} \in \left (0, 1\right ),\ i \in \lbrace 1, 2, 3 \rbrace\).

This further implies that a predicted cluster should be correctly matched with the corresponding actual cluster, which is not a straightforward procedure. Our methodology for matching each predicted cluster Cpred with the corresponding actual one Cact is depicted in (9).

$$Sim(EC_{pred}, EC_{act}) = \left\lbrace \underset{C_{pred} \in EC_{pred}}{\max} Sim^{\ast}(C_{pred}, C_{act});\ C_{act} \in EC_{act} \right\rbrace$$

In other words, we compare each Cpred with every Cact and “match” it with the most similar one. After all predicted clusters get traversed, we end up with the set Sim(ECpred,ECact) which holds all the “matchings”.

5 Experimental study

In this section, we evaluate our predictive model on two different mobility domains, namely maritime and urban transportation, and present our experimental results.

5.1 Datasets & preprocessing

From the maritime domain, we use the “Piraeus”Footnote 1 dataset [34], which consists of 103 million AIS positioning messages from 5,163 vessels (passenger boats, fisheries, cargo, containers, etc) in the wider region of Piraeus, Greece. The dataset ranges in time and space, as follows:

  • temporal range: January 1st, 2018 – December 31st, 2018 (1 year);

  • spatial range: longitude in [22.992, 24.031]; latitude in [37.437, 38.046].

During the preprocessing stage, we drop erroneous records (i.e. GPS locations) based on a speed threshold speedmax as well as stop points (by removing records corresponding to speed less than 1 knot [7]). Although it may sound optional, the latter is a critical cleansing process in order for the ML model to avoid considering non-evolving parts of the trajectories); afterwards we organize the cleansed data into trajectories based on the temporal interval between two consecutive signals of the same vessel, given a threshold dt. Finally, in order to discover evolving clusters, we need a stable and temporally aligned sampling rate. For the aforementioned dataset, we set the following thresholds: speedmax = 50 knots, dt = 30 min., and rate = 1 min. The rationale behind these thresholds stems from the characteristics of the dataset, which were unveiled after a statistical analysis of the distribution of the speed and dt between consecutive points of the same trajectory.

From the urban domain, we use the popular “GeoLife”Footnote 2 dataset [40,41,42], which contains information regarding urban movement (mainly) in Beijing, China. In particular, it consists of approximately 23.5 million records from 181 users organized in 18,454 trips, with some of their trips being semantically annotated with respect to the means of transportation they used at that time. The dataset ranges in time and space as follows:

  • temporal range: April 12th, 2007 – July 27th, 2012 (approx. 5 years);

  • spatial range: longitude in [73.50, 134.78], latitude in [17.96, 53.56] (restricted to China).

Regarding the GeoLife dataset, we omit the preprocessing step and accept it ‘as-is’. The main reason for this, is due to the fact that the raw points are already organized into trips and are neatly labeled with respect to the vehicle used. That being said, the only preprocessing we performed was to restrict its spatial range to the China region.

Fig. 5
figure 5

Snapshots of Piraeus (left) and GeoLife (right) datasets

A map visualization of (parts of) the two datasets is illustrated in Fig. 5. In particular, Fig. 5 (left) visualizes Piraeus’ locations on July 10th, 2018, while Fig. 5 (right) visualizes GeoLife locations on October 20-22nd, 2011.

5.2 Experimental setup and results

The objectives of our experimental study are related to:

  • the efficiency of our framework in terms of run time;

  • FLP prediction quality;

  • its effect on EC prediction quality;

  • the sensitivity of (8) with respect to λi; and

  • quality assessment of the clusters’ predicted evolution.

The GRU-FLP and EC algorithms were implemented in Python3 (via Anaconda virtual environments), while the ST-FLP algorithm was implemented in Java. The experiments were conducted using Apache Kafka with 1 topic for the transmitted (loaded from a CSV file) and predicted locations, as well as 1 consumer for FLP and EC, respectively.

For the EC-related experiments, we utilized a single node with 8 CPU cores, 16 GB of RAM and 256 GB of HDD. For ST-FLP, we employed a 49 node Hadoop 2.7.2 cluster, where the master node consists of 8 CPU cores, 8 GB of RAM and 60 GB of HDD while each slave node is comprised of 4 CPU cores, 4 GB of RAM and 60 GB of HDD. The offline component was implemented over Apache Hadoop and the online component by utilizing Apache Kafka for messaging and Kafka Consumers interface for stream processing. All of the above are provided by okeanos-knossos,Footnote 3 an IAAS service for the Greek Research and Academic Community. Finally, for the GRU-FLP, we used a single node equipped with 16 CPU cores, 64 GB of RAM, 1 TB of HDD, and an Nvidia GTX 2080Ti.

In the sections that follow, we provide the experimental results of our study, using the default parameters of EvolvingClusters per dataset, as defined by [35], as Table 1 illustrates. The respective source-code is available at https://github.com/DataStories-UniPi/EvolvingClusters, as a branch of the general Evolving Clusters -related library of operations.

Table 1 Datasets’ threshold values for Evolving Cluster (EC) Discovery

5.2.1 Efficiency of our framework towards online processing

It is shown that EC, GRU-FLP, and ST-FLP algoritmhs are able to run in online mode in all corresponding real-world situations that are experimented with [7, 30, 35], with response time ≈ 250 ms for predicting the vessels’ locations at a single temporal instance. However, when combined, the time complexity of the proposed framework is overwhelmed by the most ‘expensive’ steps, namely EC and CE.

Nevertheless, according to our experiments, our framework is able to run in online mode in all experimental real-world scenarios. In detail, focusing on the entire temporal horizon of both datasets, and running our framework with an activation step of 60 s, Fig. 6 illustrates the (average recorded) response time for its most “expensive” steps, namely, EC and CE.

As it appears in the chart, the response time for EC (c.f. Fig. 6a) remains below 400 ms for thousands of concurrent objects and increases up to (approx.) 1400 ms for 3,527 objects, which corresponds to the most crowded timeslice in our experimental study. Similar results are also recorded for CE, albeit with slightly increased response, with up to 1000 ms when considering ≈ 128 clusters, and up to 2500 ms for ≈ 256 clusters.

Summarising, although it takes some time to train the underlying FLP models, the response time of our framework in its online part is typically below 4 sec per monitored timeslice, thus well-placing it within the online method family. Last but not least, it is obvious that the performance of our framework is directly correlated to the hardware of the experimental setup.

Fig. 6
figure 6

Average time elapsed on (a) EC vs. #concurrent objects within a timeslice; and (b) CE vs. #ECs to consider

5.2.2 FLP prediction quality

In our method, predicting evolving clusters relies heavily on the FLP model. In this context, we compare two models, namely GRU-FLP (c.f. Section 4.1) and ST-FLP (c.f. Section 4.2), and assess their impact on EC discovery. Figure 7 illustrates the distribution of the predicted locations’ Mean Absolute Error (MAE) versus the prediction lookahead Δt. We observe that as we increase Δt, the error increases rapidly for the former, going from hundreds of meters (Δt = 1) to tens of kilometers (Δt = 10). On the other hand, for the Sub-trajectory based model the error is well-maintained to at most 1,000 meters for the Piraeus and GeoLife dataset, respectively, with slight increases due to increasing Δt, thus rendering the Sub-trajectory based model the better model for the datasets and task at hand.

Fig. 7
figure 7

Displacement error for GRU-FLP (top) and ST-FLP (bottom) models for Piraeus (a,c); and GeoLife (b,d) datasets, respectively

5.2.3 EC prediction quality

Having evaluated the quality of our predictive models, we proceed to assess the quality of the predicted evolving clusters for our models on both mobility domains with respect to their corresponding “ground truth”. As “ground truth” we define the evolving clusters discovered using the objects’ corresponding actual positions, and for quantifying the accuracy of a predicted cluster Cpred against an actual cluster Cact we use (9), with λi,i ∈ [1,3] set to their corresponding values. Without loss of generality, Fig. 8 illustrate the EC prediction quality of both predictive models over the Piraeus dataset, where solid lines correspond to the median accuracy, and the confidence intervals to the 25th and 75th quartile, respectively.

Fig. 8
figure 8

EC prediction quality – assessing models’ performance on Piraeus dataset

Observing the aforementioned figure, we clearly deduce that as we increase the lookahead (Δt), ST-FLP consistenly outperforms GRU-FLP (with median accuracy close to 70%) in all aspects of cluster matching, except for simmember, where GRU-FLP outerforms ST-FLP by a significant margin. More specifically, the performance of ST-FLP on both simtemp and simspatial show similar behaviour, decreasing with similar rate as Δt increases, a behaviour which is seemingly correlated to the FLP models’ performance.

Because, in general, both datasets are quite challenging, as they contain a higher degree of unpredictability (due to the amount of manouvers each trajectory has), we observe that both GRU-FLP and ST-FLP models have quite large confidence interval ranges, indicating that there is room for further tuning.

5.2.4 EC similarity score sensitivity

One of the main claimed contributions in our work is the proposed EC similarity score (c.f. (8)) for matching the predicted ECs with the actual ones, and this measure takes the form of a convex combination (weights \(\lambda _{1}, \lambda _{2}, \lambda _{3}, {\sum }_{i} \lambda _{i} = 1\)) of spatial, temporal, and membership similarities, respectively. The values of λi are fixed in their corresponding default values (i.e. λi = 1/3) by taking into account all matching aspects (namely, temporal, spatial, and membership) in a balanced manner (i.e., average).

To assess the effect of these weights, Fig. 9(a-c) illustrates the distribution of EC similarity score, on the Piraeus dataset (Δt = 1 min.) with respect to (a) λ2,λ3, (b) λ1,λ3, and (c) λ1,λ2. Observing the aforementioned figure, we deduce that the temporal aspect (λ1) is the most dominant, as all other aspects (namely, spatial and membership) have minor impact on the final score, while λ1 = 1/3.

This behaviour is expected, as the predicted co-movement pattern may demonstrate a temporal and/or membership overlap with its actual one, but, due to the models’ displacement error, exhibit minor spatial overlap, an increasing behaviour, especially for higher Δt, e.g., 15 min. On the other hand, fixing λ2, and λ3, we observe that as we increase the value of λ1, the similarity score decreases with an - almost - steady rate, further showing the trade-off of focusing more on a single similarity aspect over another. Similar observations can be observed on the GeoLife dataset (c.f. Fig. 9(d-f)), albeit with steeper increase/decrease rates, as we increase λ2 and λ3, respectively.

Fig. 9
figure 9

Assessing the sensitivity of EC similarity (c.f. (8)) on Piraeus (a,b,c) and Geolife (d,e,f) datasets, with respect to (a,d) λ2,λ3, (b,e) λ1,λ3, and (c,f) λ1,λ2

5.2.5 Cluster evolution accuracy

After evaluating the EC prediction within the spatiotemporal dimension, we may assess the prediction accuracy regarding the clusters’ evolution. As defined in Section 4.4, a cluster C will emerge, survive for some time duration Δt, and then dissapear, or be absorbed into another cluster \(C^{\prime }\) or split into n clusters \(C_{1}, C_{2}, \dots , C_{n}\).

Exploiting on the MONIC/FINGERPRINT framework [22, 27], we create a graph G = (V,E), with its nodes corresponding to the aforementioned ECs (〈Pi,Ti,Tj,tp〉,j > i) and its edges corresponding to the event (i.e., emerged, survived, absorbed, split, disappeared) which occurred between these two clusters. For instance, consider the setting that appears in Fig. 1 for tp = 2 (MCS); the respective graph G of cluster evolutions is illustrated in Fig. 10a. This graph represents the actual situation of cluster evolution; hence it is Gact. On the other hand, let us assume that our framework predicted a different situation; more specifically, at T5 all clusters were combined into a single entity P7 = 〈{a,b,c,d,e,f,g,h,i},T5,T5,2〉, which is represented in Gpred, as illustrated in Fig. 10b.

Fig. 10
figure 10

Discovered ECs’ evolution graph on (a) actual (Cact); and (b) predicted (Cpred) situation, respectively

Table 2 Evaluating ECs’ evolution over time (via nGED), using GRU-FLP vs. ST-FLP models on Piraeus and GeoLife datasets (lower is better)

In order to compare the evolution over time between a Gact and Gpred, we exploit on the Graph Edit Distance (GED) measure. In particular, GED is a graph similarity measure analogous to Levenshtein distance for strings, defined as the minimum cost of edit path (sequence of node and edge edit operations) transforming a graph G1 to another graph G2, such that G1 and G2 are isomorphic [1]. Because the resulted graphs may significantly vary in their corresponding size, thus resulting in confusion regarding the final score, we use the normalized GED (nGED – c.f. (10)) in order to bound the scores between zero and one, or in other words, between full and none isomorphism, respectively.

$$nGED = \frac{GED(G_{act}, G_{pred})}{2 * \max{(\lvert V_{act} \rvert + \lvert E_{act} \rvert,\ \lvert V_{pred} \rvert + \lvert E_{pred} \rvert)}},$$

where Gactual, Gpred the evolution graphs of Cact and Cpred, with V, and E, their corresponding vertices and edges, respectively.

After comparing the evolution graphs for each predicted EC with its corresponding actual one, Table 2 illustrates the mininum, median, and maximum nGED, respectively. For the Piraeus dataset, we can clearly observe that in both FLP models, as Δt increases so does the nGED, albeit with some minor noise, between the actual and predicted EC evolution graphs.

Moreover, for the GeoLife dataset, we observe similar behaviour but with less isomorphic graphs, a result correlated with the models’ performance over the aforementioned datasets. These conclusions are in accordance with the ones derived from Fig. 8, further strengthening the (intuitively obvious) correlation between FLP accuracy and EC prediction, even at higher levels of abstraction (i.e., cluster evolution).

5.3 Towards exploiting the predicted clusters

A very important stage in the data analytics hierarchy is data storytelling; in other words, how to deduce meaningful and easy to interpret outcomes from a (curated) dataset at hand. After executing the EC module on both datasets for the actual and predicted locations, respectively, we present some further findings, aiming to reach some insightful conclusions regarding the objects’ future activity.

Fig. 11
figure 11

Heatmap of ECs in a region of the Piraeus dataset using GRU-FLP (top) and ST-FLP (bottom) models (left: actual; right: predicted)

Towards this direction, we construct heatmaps of the discovered ECs using the cardinality of each cluster as the measurable magnitude and setting the parameters of EC to their corresponding default values (c.f. Table 1). In particular, Figures 9 and 10 illustrate two heatmaps of actual (left) and predicted ECs (right) in indicative regions of Piraeus dataset (cells of size 225x225 m) and GeoLife dataset (cells of size 256x256 m), respectively. In order to avoid repetitions, we choose to focus our discussion on Δt = 1.

Starting with the Piraeus dataset, if we are interested in predicting the vessels’ port entry/exit routes, Fig. 11 (top) provides an answer: it can be observed that our GRU-based model predicted that most traffic will be concentrated at the ferry lines connecting the ports at peak hours, with emphasis on the ferry line connecting Salamina with Perama. Even further, focusing on the ST-FLP model, we not only observe the same behaviour (albeit with higher resolution), but also we can accurately predict the intra-port activity, a conclusion that further empowers our confidence towards a predictive framework able to correctly anticipate incoming vessels, thus effectively managing port traffic. Except real-life applications, the aforementioned findings may trigger domain experts into further investigating these occurences and reach some meaningful conclusions.

Fig. 12
figure 12

Heatmap of ECs in a region of the GeoLife dataset using GRU-FLP (top) and ST-FLP (bottom) models (left: actual; right: predicted)

Moving to the urban domain and regarding the GeoLife dataset, as Fig. 12 illustrates, most traffic lies around some places of interest (such as parks, shops, and university campuses, as it can be easily deduced when comparing the spatial footprint of the discovered evolving clusters with open spatial data sources, like Wikimapia.org, LinkedGeoData.org, etc.). More specifically, focusing on the nothern part of Beijing, the GRU-FLP model predicts not only moderate traffic at the aforementioned areas, but also mild traffic on surrounding roads at peak hours. Similar results are also obtained at the south-western side of Beijing using the ST-FLP model. In general, combined with the hours they are created, a domain expert (e.g. transportation analyst) could easily determine where and when a congestion will occur, thus using our predictive models as a utility to effectively reroute urban traffic.

6 Conclusions and future work

In this paper, we propose a unified framework for the online prediction of evolving clusters and their evolution over time. The proposed method is based on a combination of FLP and EC algorithms and is evaluated through two real-world datasets from the maritime and urban domain, respectively, taking into account a novel co-movement pattern similarity measure, which is able to match the predicted clusters with the actual ones. Our experimental study on real-life datasets demonstrates the efficiency and effectiveness of the proposed methodology. Thus, based on the potential applications, as well as the quality of the results produced, we believe that the proposed model can be a valuable utility for researchers and practitioners alike.

In the near future, we aim to further tune our models with respect to their respective architecture and/or parameters, and experiment on more datasets from other domains (e.g. sports analytics) in order to demonstrate the versatility of our framework.