# Encyclopedia of GIS

2017 Edition
| Editors: Shashi Shekhar, Hui Xiong, Xun Zhou

# Trajectories, Discovering Similar

• George Kollios
• Michail Vlachos
• Dimitrios Gunopulos
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-17885-1_1401

## Definition

The trajectory of a moving object is typically modeled as a sequence of consecutive locations in a multi-dimensional (generally two or three dimensional) Euclidean space. Such data types arise in many applications where the location of a given object is measured repeatedly over time. Typical trajectory data are obtained during a tracking procedure with the aid of various sensors. Here also lies the main obstacle of such data; they may contain a significant amount of outliers or in other words incorrect data measurements (unlike for example, stock data which contain no errors whatsoever). An example of two trajectories is shown in Fig. 1.

Many data mining tasks, such as clustering and classification, necessitate a distance function that is used to estimate the similarity or dis-similarity between any two objects in the database. Furthermore, in order to provide efficient solutions to many data mining tasks, a method that can retrieve quickly the data objects that are more similar to a given query object (or a set of objects) is required. Therefore, to perform data mining on trajectories of moving objects, the following problem must be addressed: given a database $$\mathcal{D}$$ of trajectories and a query $$\mathcal{Q}$$ (not already in the database), the system has to find the trajectory $$\mathcal{T}$$ that is closest to $$\mathcal{Q}$$. In order to solve this problem, two important sub-problems must be addressed: (i) define a realistic and appropriate distance function and (ii) design an indexing scheme for answering the nearest neighbor query efficiently.

## Historical Background

Trajectories are modeled as multi-dimensional time series. Most of the related work on time-series data analysis has concentrated on the use of some metric L p norm. The L p norm distance between two n-dimensional vectors $$\bar{x}$$ and $$\bar{y}$$ is defined as $$L_{p}(\bar{x},\bar{y}) = (\sum _{i=1}^{n}(\vert x_{i} - y_{i}\vert )^{p})^{1/p}$$. For p = 2 it is the well known Euclidean distance and for p = 1 the Manhattan distance. The advantage of this simple metric is that it allows efficient indexing with a dimensionality reduction technique (Agrawal et al. 1993; Faloutsos et al. 1994; Keogh et al. 2001). On the other hand, the model cannot deal well with outliers and is very sensitive to small distortions in the time axis (Das et al. 1997). There are a number of interesting extensions to the above model to support various transformations such as scaling (Rafiei and Mendelzon 2000), shifting (Goldin and Kanellakis 1995), normalization (Goldin and Kanellakis 1995) and moving average (Rafiei and Mendelzon 2000).

Other techniques to define time series similarity are based on extracting certain features (landmarks (Perng et al. 2000) or signatures (Faloutsos et al. 1997)) from each time-series and then using these features to define the similarity. Another approach is to represent a time series using the direction of the sequence at regular time intervals (Qu et al. 1998).

Although the vast majority of database/data mining research on time series data mining has focused on Euclidean distance, virtually all real world systems that use time series matching as a subroutine, utilize a similarity measure which allows warping. In retrospect, this is not very surprising, since most real world processes, particularly biological processes, can evolve at varying rates. For example, in bioinformatics, it is well understood that functionally related genes will express themselves in similar ways, but possibly at different rates. Therefore, the Dynamic Time Warping (DTW) distance has been used for many datasets of this type. The method to compute DTW between two sequences is based on Dynamic Programming (Berndt and Clifford 1994) and is more expensive than computing L p norms. Approaches to mitigate the large computational cost of the DTW have appeared in Keogh (2002) and Zhu and Shasha (2003) where lower bounding functions are used in order to speed up the execution of DTW. Furthermore, an approach to combine the benefits of warping distances and L p norms has been proposed in Chen and Ng (2004).

The flexibility provided by DTW is very important, however its efficiency deteriorates for noisy data, since by matching all the points, it also matches the outliers distorting the true distance between the sequences. An alternative approach is the use of Longest Common Subsequence (LCSS), which is a variation of the edit distance (Levenshtein 1966). The basic idea is to match two sequences by allowing them to stretch, without rearranging the order of the elements but allowing some elements to be unmatched. Using the LCSS of two sequences, one can define the distance using the length of this subsequence (Bollobás et al. 1997; Das et al. 1997).

## Scientific Fundamentals

First, some definitions are provided and then the similarity functions based on the appropriate models are presented. It is assumed that objects are points that move on the (x, y)-plane and time is discrete.

Let A and B be two trajectories of moving objects with size n and m respectively, where A = ((ax, 1, ay, 1), , (ax, n, ay, n)) and B = ((bx, 1, by, 1), , (bx, m, by, m)). For a trajectory A, let Head(A) be the sequence Head(A) = ((ax, 1, ay, 1), , (ax, n−1, ay, n−1)).

Given an integer δ and a real number 0 < ε < 1, the LCSSδ, ε(A, B) is defined as follows:
$$\displaystyle\begin{array}{rcl} & & LCSS_{\delta,\epsilon }(A,B) {}\\ & & = \left \{\begin{array}{l} 0,\quad \text{if}\ A\ \text{or}\ B\ \text{is}\ \text{empty} \\ 1 + LCSS_{\delta,\epsilon }(Head(A),Head(B)),\quad \\ \text{if}\quad \vert a_{x,n} - b_{x,m}\vert \quad < \quad \epsilon \quad \text{and}\quad \\ \quad \vert a_{y,n} - b_{y,m}\vert <\epsilon \quad \text{and}\quad \vert n - m\vert <\delta \\ \max (LCSS_{\delta,\epsilon }(Head(A),B),\;LCSS_{\delta,\epsilon } \\ (A,Head(B))),\ \text{otherwise}\:. \end{array} \right. {}\\ \end{array}$$
The constant δ controls how far in time can go in order to match a given point from one trajectory to a point in another trajectory. The constant ɛ is the matching threshold (see Fig. 2).

The first similarity function is based on the LCSS and the idea is to allow time stretching. Then, objects that are close in space at different time instants can be matched if the time instants are also not very far.

Therefore, the similarity function S1 between two trajectories A and B, given δ and ɛ, is defined as follows:
$$\displaystyle{S1(\delta,\epsilon,A,B) = \frac{LCSS_{\delta,\epsilon }(A,B)} {\min (n,m)} \:.}$$
The division by the length of the sequence in S1 serves the purpose of comparing the LCSS value between sequences of different lengths.

The S1 function is used to define another similarity measure that is more suitable for trajectories. Consider the set of all translations. A translation simply shifts a trajectory in space by a different constant in each dimension. Let $$\mathcal{F}$$ be the family of translations. Then a function fc, d belongs to $$\mathcal{F}$$ if fc, d(A) = ((ax, 1 + c, ay, 1 + d), , (ax, n + c, ay, n + d)). Using this family of translation, the following distance function is defined.

Given δ, ɛ and the family $$\mathcal{F}$$ of translations, the similarity function S2 between two trajectories A and B, is defined as follows:
$$\displaystyle{S2(\delta,\epsilon,A,B) =\max _{f_{c,d}\in \mathcal{F}}S1(\delta,\epsilon,A,f_{c,d}(B))\:.}$$
The similarity functions S1 and S2 range from 0 to 1. Therefore, the distance function between two trajectories can be estimated as follows:
Given δ, ɛ and two trajectories A and B, then:
$$\displaystyle{ \begin{array}{rl} &D1(\delta,\epsilon,A,B) = 1 - S1(\delta,\epsilon,A,B)\quad \mathrm{and}\quad \\ &D2(\delta,\epsilon,A,B) = 1 - S2(\delta,\epsilon,A,B)\:. \end{array} }$$
Note that D1 and D2 are symmetric. LCSSδ, ε(A, B) is equal to LCSSδ, ε(B, A) and the transformation that is used in D2 is a translation which preserves the symmetric property.

By allowing translations, similarities between movements that are parallel in space can be detected. In addition, the LCSS model allows stretching and displacement in time, so it can detect similarities in movements that happen with different speeds, or at different times.

Given the definitions above, efficient methods to compute the distance functions are presented next.

### Computing the Similarity Function S1

To compute the similarity functions S1, S2 an LCSS computation is needed. The LCSS can be computed by a dynamic programming algorithm in O(n2) time. However, if matchings are allowed only when the difference in the indices is at most δ, a faster algorithm is possible. The following result has been shown in Berndt and Clifford (1994) and Das et al. (1997): Given two trajectories A and B, with | A | = n and | B | = m, the LCSSδ, ε(A, B) can be found in O(δ(n + m)) time.

If δ is small, the dynamic programming algorithm is very efficient. However, for some applications, δ may need to be large. For that case, the above computation can be improved using random sampling.

By taking a sufficiently small amount of random samples from the original data, it can be shown that with high probability the random sample preserves the properties (shape, structure, average value, etc) of the original population. The random sampling method will give an approximate result but with a probabilistic guarantee on the error. In particular, it can be shown that, given two trajectories A and B with length n, two constants δ and ɛ, and a random sample of A, | RA | = s, an approximation of the LCSS(δ, ε, A, B) can be computed such that the approximation error is less than β with probability at least 1 −ρ, in O(ns) time, where s = f(ρ, β). To give a practical perspective of the random sampling approach, to be within 0.1 of the true similarity of two trajectories A and B, with a probability of 90% and the similarity between them is around 0.8, the A should be sampled at 250 locations. Notice that this number is independent of the length of both A and B. To be able to capture accurately the similarity between less similar trajectories (e.g., with 0.4 similarity) then more sample points must be used (e.g., 500 points).

### Computing the Similarity Function S2

Consider now the more complex similarity function S2. Here, given two sequences A, B, and constants δ, ε, the translation fc, d that maximizes the length of the longest common subsequence of A, fc, d(B) (LCSSδ, ε(A, fc, d(B)) over all possible translations must be found.

Let the length of trajectories A and B be n and m respectively. Let also assume that the translation $$f_{c_{1},d_{1}}$$ is the translation that, when applied to B, gives a longest common subsequence: $$LCSS_{\delta,\epsilon }(A,f_{c_{1},d_{1}}(B)) = a$$, and it is also the translation that maximizes the length of the longest common subsequence: $$LCSS_{\delta,\epsilon }(A,f_{c_{1},d_{1}}(B)) =\max _{c,d\in \mathcal{R}}LCSS_{\delta,\epsilon }(A,f_{c,d}(B))$$.

The key observation is that, although there is an infinite number of translations that can be applied on B, each translation fc, d results in a longest common subsequence between A and fc, d(B), and there is a finite set of possible longest common subsequences. Therefore, it is possible to enumerate the set of translations, such that this set provably includes a translation that maximizes the length of the longest common subsequence of A and fc, d(B). Based on this idea, it has been shown in Vlachos et al. (2002) that: Given two trajectories A and B, with | A | = n and | B | = m, the S2(δ, ε, A, B) can be computed in O((n + m)3δ3) time.

Furthermore, a more efficient algorithm has been proposed that achieves a running time of O((m + n)δ3β2), given a constant 0 < β < 1. However, this algorithm is approximate and the approximation AS2δ, β(A, B) is related with the actual distance with the formula: S2(δ, ε, A, B) − AS2δ, β(A, B) < β.

### Indexing for LCCS Based Similarity

Even though the approximation algorithm for the D2 distance significantly reduces the computational cost over the exact algorithm, it can still be costly when one is interested in similarity search on massive trajectory databases. Thus, a hierarchical clustering algorithm using the distance D2 is provided that can be used to answer efficiently similarity queries.

The major obstacle in providing an indexing scheme for the distance function D2 is that D2 is not a metric, since it does not obey the triangle inequality. This makes the use of traditional indexing techniques difficult. Indeed, it is easy to construct examples with trajectories A, B and C, where D2(δ, ε, A, C) > D2(δ, ε, A, B) + D2(δ, ε, B, C). Such an example is shown in Fig. 3, where D2(δ, ε, A, B) = D2(δ, ε, B, C) = 0 (since the similarity is 1), and D2(δ, ε, A, C) = 1 (because the similarity within ɛ in space is zero).
However, a weaker version of the triangle inequality can be proven, which can help prunning parts of the database and improve the search performance. First, the following function is defined:
$$\displaystyle{\begin{array}{rl} &LCSS_{\delta,\epsilon,\mathcal{F}}(A,B) \\ &\quad =\max _{f_{c,d}\in \mathcal{F}}LCSS_{\delta,\epsilon }(A,f_{c,d}(B))\:.\end{array} }$$
Clearly, $$D2(\delta,\epsilon,A,B) = 1 -\frac{LCSS_{\delta,\epsilon,\mathcal{F}}(A,B)} {min(\vert A\vert,\vert B\vert )}$$ (as before, $$\mathcal{F}$$ is the set of translations). Now, the following can be shown: Given trajectories A, B, C:
$$\displaystyle{ \begin{array}{rl} LCSS_{\delta,2\epsilon,\mathcal{F}}(A,C)& \geq LCSS_{\delta,\epsilon,\mathcal{F}}(A,B) \\ & + LCSS_{\delta,\epsilon,\mathcal{F}}(B,C) -\vert B\vert \quad \end{array} }$$
where | B | is the length of sequence B.

To create the indexing structure, the set of trajectories is partitioned into groups according to their length, so that the longest trajectory in each group is at most a times the shortest (typically a = 2 is used.) Then, a hierarchical clustering algorithm is applied on each set, and the tree that the algorithm produces is used as follows:

For every node C of the tree, the medoid (M C ) of the cluster represented by this node is stored. The medoid is the trajectory that has the minimum distance (or maximum LCSS) from every other trajectory in the cluster: $$\max _{v_{i}\in C}\min _{v_{j}\in C}LCSS_{\delta,\epsilon,\mathcal{F}}(v_{i},v_{j},e)$$. However, keeping only the medoid is not enough. Note that, a method is needed to efficiently prune part of the tree during the search procedure. Namely, given the tree and a query sequence Q, the algorithm should decide whether to follow the subtree that is rooted at C or not. However, from the previous lemma it is known that for any sequence B in C:
$$\displaystyle{ \begin{array}{rl} LCSS_{\delta,\epsilon,\mathcal{F}}(B,Q) & < \vert B\vert + LCSS_{\delta,2\epsilon,\mathcal{F}}(M_{C},Q) \\ &\qquad \ \ - LCSS_{\delta,\epsilon,\mathcal{F}}(M_{C},B)\end{array} }$$
or in terms of distance:
$$\displaystyle{ \begin{array}{rl} &D2(\delta,\epsilon,B,Q) = 1 -\frac{LCSS_{\delta,\epsilon,\mathcal{F}}(B,Q)} {\min (\vert B\vert,\vert Q\vert )} \\ & \qquad > 1 - \frac{\vert B\vert } {\min (\vert B\vert,\vert Q\vert )} - \frac{LCSS_{\delta,2\epsilon,\mathcal{F}}(M_{C},Q)} {\min (\vert B\vert,\vert Q\vert )} \\ & \qquad + \frac{LCSS_{\delta,\epsilon,\mathcal{F}}(M_{C},B)} {\min (\vert B\vert,\vert Q\vert )} \:. \end{array} }$$
In order to provide an upper bound on the similarity (or a lower bound on the distance) the expression $$\vert B\vert - LCSS_{\delta,\epsilon,\mathcal{F}}(A,B)$$ must be maximized. Therefore, for every node of the tree along with the medoid the trajectory r c that maximizes this expression is stored. Using this trajectory a lower bound on the distance between the query and any trajectory on the subtree can be estimated.

Next, the search function that uses the index structure discussed above is presented. It is assumed that the tree contains trajectories with minimum length minl and maximum length maxl. For simplicity, only the algorithm for the 1-Nearest Neighbor query is presented.

The search procedure takes as input a node N in the tree, the query Q and the distance to the closest trajectory found so far (Fig. 4). For each of the children C, it is checked if it is a trajectory or a cluster. In case that it is a trajectory, its distance to Q is compared with the current nearest trajectory. If it is a cluster, first the length of the query is checked and then the appropriate value for min( | B |, | Q | ) is chosen. Thus, a lower bound L is computed on the distance of the query with any trajectory in the cluster and the result is compared with the distance of the current nearest neighbor mindist. This cluster is examined only if L is smaller than mindist. In the scheme above, the approximate algorithm to compute the $$LCSS_{\delta,\epsilon,\mathcal{F}}$$ is used. Consequently, the value of $$(LCSS_{\delta,\epsilon,\mathcal{F}}(M_{C},B))/(\min (\vert B\vert,\vert Q\vert ))$$ that is computed can be up to β times higher than the exact value. Therefore, since the approximate algorithm of section 3.2 is used, the (β ∗ min( | M C |, | B | ))∕(min( | B |, | Q | )) should be subtracted from the bound for D2(δ, ε, B, Q) to get the correct results.

## Key Applications

### Sciences

Trajectory data with the characteristics discussed above (multi-dimensional and noisy) appear in many scientific data. In environmental, earth science and biological data analysis, scientists may be interested in identifying similar patterns (e.g., weather patterns), cluster related objects or subjects based on their trajectories and retrieve subjects with similar movements (e.g., in animal migration studies). In medical applications similar problems may occur, for example, when multiple attribute response curves in drug therapy are analyzed.

### Transportation and Monitoring Applications

In many monitoring applications, detecting movements of objects or subjects that exhibit similarity in space and time can be useful. These movements may have been reconstructed from a set of sensors, including cameras and movement sensors and therefore are inherently noisy. Another set of applications arise from cell phone and mobile communication applications where mobile users are tracked over time and patterns and clusters of these users can be used for improving the quality of the network (i.e., by allocating appropriate bandwidth over time and space).

## Future Directions

So far, it is assumed that objects are points that move in a multi-dimensional space, ignoring their shape in space. However, there are many applications where the extent of each object is also important. Therefore, a future direction, is to design similarity models for moving objects with extents, when both the locations and the extents of the objects change over time.

Another direction is to design a more general indexing scheme for distance functions that are similar to LCCS and can work for multiple distance functions and datasets.

## References

1. Agrawal R, Faloutsos C, Swami A (1993) Efficient similarity search in sequence databases. In: Proceedings of FODO, pp 69–84Google Scholar
2. Berndt D, Clifford J (1994) Using dynamic time warping to find patterns in time series. In: AAAI workshop on knowledge discovery in databases, pp 229–248Google Scholar
3. Bollobás B, Das G, Gunopulos D, Mannila H (1997) Time-series similarity problems and wellseparated geometric sets. In: Proceedings of SCG
4. Chen L, Ng RT (2004) On the marriage of lp-norms and edit distance. In: Proceedings of VLDB, pp 792–803Google Scholar
5. Das G, Gunopulos D, Mannila H (1997) Finding similar time series. In: Proceedings of PKDD, pp 88–100Google Scholar
6. Faloutsos C, Jagadish HV, Mendelzon A, Milo T (1997) Signature technique for similarity-based queries. In: Proceedings of SEQUENCESGoogle Scholar
7. Faloutsos C, Ranganathan M, Manolopoulos I (1994) Fast subsequence matching in time series databases. In: Proceedings of ACM SIGMOD, May 1994
8. Goldin D, Kanellakis P (1995) On similarity queries for time-series data. In: Proceedings of CP ’95, Sept 1995Google Scholar
9. Keogh E (2002) Exact indexing of dynamic time warping. In: Proceedings of VLDB
10. Keogh E, Chakrabarti K, Mehrotra S, Pazzani M (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of ACM SIGMOD, pp 151–162
11. Levenshtein V (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10 10:707–710
12. Perng S, Wang H, Zhang S, Parker DS (2000) Landmarks: a new model for similarity-based pattern querying in time series databases. In: Proceedings of IEEE ICDE, pp 33–42Google Scholar
13. Qu Y, Wang C, Wang XS (1998) Supporting fast search in time series for movement patterns in multiple scales. In: Proceedings of ACM CIKM, pp 251–258Google Scholar
14. Rafiei D, Mendelzon A (2000) Querying time series data based on similarity. IEEE Trans Knowl Data Eng 12(5):675–693
15. Vlachos M, Kollios G, Gunopulos D (2002) Discovering similar multidimensional trajectories. In: Proceedings of IEEE ICDE, pp 673–684Google Scholar
16. Zhu Y, Shasha D (2003) Query by humming: a time series database approach. In: Proceedings of ACM SIGMOD

© Springer International Publishing AG 2017

## Authors and Affiliations

• George Kollios
• 1
• Michail Vlachos
• 2
• Dimitrios Gunopulos
• 3
1. 1.Computer Science Department, Boston UniversityBostonUSA
2. 2.IBM T.J. Watson Research CenterYorktown HeightsUSA
3. 3.Department of Computer Science and Engineering, Bourns College of Engineering, The University of California at RiversideRiversideUSA