Abstract
The similarity join is a common yet expensive operator for large-scale semantic trajectories analytics. In this paper, we propose DFST, an efficient framework for semantic trajectory similarity join in distributed systems. We devise ITS index and summary index, which consider textual, temporal, and spatial domains, and theoretically demonstrate that they can effectively prune pairs of dissimilar trajectories. Moreover, DFST can support most existing similarity functions to quantify the spatial similarity between semantic trajectories. We have conducted extensive experiments on real world datasets, and experimental results show that DFST achieves a 13.6% improvement of join performance compared to existing semantic trajectory similarity join methods.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are available from the corresponding author, [Ruijie Tian], upon reasonable request.
References
Alarabi L (2017) St-hadoop: a mapreduce framework for big spatio-temporal data. In: Proceedings of the 2017 ACM International conference on management of data. SIGMOD ’17, pp 40–42. Association for computing machinery. https://doi.org/10.1145/3055167.3055181
Alarabi L (2021) Summit: a scalable system for massive trajectory data management 10(3), 2–3. https://doi.org/10.1145/3307599.3307601. Accessed 22 Nov 2021
Belesiotis A, Skoutas D, Efstathiades C, Kaffes V, Pfoser D (2018) Spatio-textual user matching and clustering based on set similarity joins. VLDB J 27(3):297–320. https://doi.org/10.1007/s00778-018-0498-5
Berndt DJ, Clifford J (1994) Using dynamic time warping to find patterns in time series. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. AAAIWS’94, pp 359–370. AAAI Press
Bhatti UA, Huang M, Wu D, Zhang Y, Mehmood A, Han H (2019) Recommendation system using feature extraction and pattern recognition in clinical care systems. Enterp Inf Syst 13(3):329–351. https://doi.org/10.1080/17517575.2018.1557256
Bhatti UA, Yu Z, Chanussot J, Zeeshan Z, Yuan L, Luo W, Nawaz SA, Bhatti MA, Ain QU, Mehmood A (2022) Local similarity-based spatial–spectral fusion hyperspectral image classification with deep cnn and gabor filtering. IEEE Trans Geosci Remote Sens 60:1–15. https://doi.org/10.1109/TGRS.2021.3090410
Bouros P, Ge S, Mamoulis N (2012) Spatio-textual similarity joins. In: Proceedings of the VLDB Endowment, vol 6, pp 1–12. https://doi.org/10.14778/2428536.2428537
Chen L, Özsu MT, Oria V (2005) Robust and fast similarity search for moving object trajectories. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. SIGMOD ’05, pp 491–502. Association for Computing Machinery, New York. https://doi.org/10.1145/1066157.1066213
Chen L, Shang S, Jensen CS, Yao B, Kalnis P (2020) Parallel semantic trajectory similarity join. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp 997–1008. https://doi.org/10.1109/ICDE48307.2020.00091
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: Experimental comparison of representations and distance measures. Proc. VLDB Endow. 1(2), 1542–1552. https://doi.org/10.14778/1454159.1454226
Ferrante M, Bongiorno C, Shoval N (2019) Similarity of GPS trajectories using dynamic time warping: an application to cruise tourism. In: Crocetta C (ed) Theoretical and applied statistics. Springer proceedings in mathematics & statistics. Springer, Cham, pp 91–101. https://doi.org/10.1007/978-3-030-05420-5_10
Hu H, Li G, Bao Z, Feng J, Wu Y, Gong Z, Xu Y (2016) Top-k spatio-textual similarity join. IEEE Trans Knowl Data Eng 28(2):551–565. https://doi.org/10.1109/TKDE.2015.2485213
Li R, He H, Wang R, Ruan S, He T, Bao J, Zhang J, Hong L, Zheng Y (2021) Trajmesa: a distributed nosql-based trajectory data management system. IEEE Transactions on Knowledge and Data Engineering, 1–1. https://doi.org/10.1109/TKDE.2021.3079880
Liu S, Li G, Feng J (2012) Star-join: spatio-textual similarity join. In: Proceedings of the 21st ACM International conference on information and knowledge management. CIKM ’12, pp 2194–2198. Association for computing machinery. https://doi.org/10.1145/2396761.2398600
Liu S, Li G, Feng J (2014) A prefix-filter based method for spatio-textual similarity join. IEEE Trans Knowl Data Eng 26(10):2354–2367. https://doi.org/10.1109/TKDE.2013.83
Mark DB, Otfried C, Marc VK, Mark O (2008) Computational geometry: algorithms and applications springer
Parent C, Spaccapietra S, Renso C, Andrienko G, Andrienko N, Bogorny V, Damiani ML, Gkoulalas-Divanis A, Macedo J, Pelekis N, Theodoridis Y, Yan Z (2021) Semantic trajectories modeling and analysis 45(4), 42–14232. https://doi.org/10.1145/2501654.2501656. Accessed 13 Dec 2021
Rao J, Lin J, Samet H (2014) Partitioning strategies for spatio-textual similarity join. In: Proceedings of the 3rd ACM SIGSPATIAL International workshop on analytics for big geospatial Data. BigSpatial ’14, pp 40–49. Association for computing machinery. https://doi.org/10.1145/2676536.2676542
Shang S, Chen L, Wei Z, Jensen CS, Zheng K, Kalnis P (2018) Parallel trajectory similarity joins in spatial networks. VLDB J 27(3):395–420. https://doi.org/10.1007/s00778-018-0502-0
Shang Z, Li G, Bao Z (2018) Dita: distributed in-memory trajectory analytics. In: Proceedings of the 2018 International conference on management of data. SIGMOD ’18, pp 725–740. Association for computing machinery, New York, NY, USA. https://doi.org/10.1145/3183713.3183743
Ta N, Li G, Xie Y, Li C, Hao S, Feng J (2017) Signature-based trajectory similarity join. IEEE Trans Knowl Data Eng 29(4):870–883. https://doi.org/10.1109/TKDE.2017.2651821
Tampakis P, Doulkeridis C, Pelekis N, Theodoridis Y (2020) Distributed subtrajectory join on massive datasets. ACM Trans Spatial Algo Syst 6 (2):1–29. https://doi.org/10.1145/3373642
Toohey K, Duckham M (2015) Trajectory similarity measures. SIGSPATIAL Special 7(1):43–50. https://doi.org/10.1145/2782759.2782767
Vu T, Eldawy A (2018) R-grove: growing a family of r-trees in the big-data forest. In: Proceedings of the 26th ACM SIGSPATIAL International conference on advances in geographic information systems. SIGSPATIAL ’18, pp 532–535. Association for computing machinery. https://doi.org/10.1145/3274895.3274984
Wang X, Mueen A, Ding H, Trajcevski G, Scheuermann P, Keogh E (2013) Experimental comparison of representation methods and distance measures for time series data. Data Min Knowl Discov 26(2):275–309. https://doi.org/10.1007/s10618-012-0250-5
Wang N, Zeng J, Chen M, Zhu S (2020) An efficient algorithm for spatio-textual location matching. Distrib Parallel Databases 38(3):649–666. https://doi.org/10.1007/s10619-020-07289-9
Wang X, Zhang W, Zhang Y, Lin X, Huang Z (2017) Top-k spatial-keyword publish/subscribe over sliding window. VLDB J 26 (3):301–326. https://doi.org/10.1007/s00778-016-0453-2
Yuan J, Zheng Y, Xie X, Sun G (2013) T-drive: enhancing driving directions with taxi drivers’ intelligence. IEEE Trans Knowl Data Eng 25(1):220–232. https://doi.org/10.1109/TKDE.2011.200
Zhang Y, Ma Y, Meng X (2014) Efficient spatio-textual similarity join using mapreduce. In: 2014 IEEE/WIC/ACM International joint conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol 1, pp 52–59. https://doi.org/10.1109/WI-IAT.2014.16
Zhang D, Tan K-L, Tung AKH (2013) Scalable top-k spatial keyword search . EDBT ’13, pp 359–370. Association for computing machinery. https://doi.org/10.1145/2452376.2452419
Zheng K, Shang S, Yuan NJ, Yang Y (2013) Towards efficient search for activity trajectories. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp 230–241. https://doi.org/10.1109/ICDE.2013.6544828
Zheng B, Yuan NJ, Zheng K, Xie X, Sadiq S, Zhou X (2015) Approximate keyword search in semantic trajectory database. In: 2015 IEEE 31st International conference on data engineering, pp 975–986. https://doi.org/10.1109/ICDE.2015.7113349
Zheng B, Zheng K, Sharaf MA, Zhou X, Sadiq S (2014) Efficient retrieval of top-k most similar users from travel smart card data. In: 2014 IEEE 15th International conference on mobile data management, vol 1, pp 259–268. https://doi.org/10.1109/MDM.2014.38
Zheng K, Zheng B, Xu J, Liu G, Liu A, Li Z (2017) Popularity-aware spatial keyword search on activity trajectories. World Wide Web 20(4):749–773. https://doi.org/10.1007/s11280-016-0414-0
Yang S, Cheema MA, Lin X, Zhang Y, Zhang W (2017) Reverse k nearest neighbors queries and spatial reverse top-k queries. The VLDB Journal 26:151–176. https://doi.org/10.1007/s00778-016-0445-2
Acknowledgements
This work was supported by the National Key Research and Development Program of China (2020YFF0410947) and the National Natural Science Foundation of China (62103072). Additional funding was provided by the China Postdoctoral Science Foundation (2021M690502) and Fundamental Research Funds for the Central Universities (3132022647).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Other distance measures
A.1 Dynamic Time Warping (DTW)
It computes the minimum cumulative distance when two trajectories match [4].
Definition 5
(DTW) Given two trajectories \(\mathcal {T}=\left \lbrace o^{\mathcal {T}}_1,\ldots ,o^{\mathcal {T}}_m\right \rbrace \) and \(tr=\left \lbrace o^{tr}_1,\ldots ,o^{tr}_n\right \rbrace \), DTW [4] is computed as below.
where \(\mathcal {T}^{m-1}\) is the prefix trajectory of \(\mathcal {T}\) by removing the last point.
According to (6) and Definition 5, we can conclude that \(DTW(\mathcal {T},tr)\) is not less than Fréchet\((\mathcal {T},tr)\) constant. Given two trajectories \(\mathcal {T}_1\) and \(\mathcal {T}_3\) in Fig. 17, \(DTW(\mathcal {T}_1,\mathcal {T}_3)=6.41 >\) Fréchet\((\mathcal {T}_1,\mathcal {T}_3)=1.41\). To support DTW, DFST doesn’t need to update ε by accumulating distances from it when querying the partitions. Similarly, we can still utilize partition/node distance lower bound pruning and summary pruning.
A.2 Edit Distance on Real Sequences(EDR)
Definition 6 (EDR)
Given two trajectories \(\mathcal {T}\) and \(\mathcal {Q}\), and a matching threshold δ ≥ 0, EDRδ [23] is:
where \(\mathcal {T}^{2,m}\) stands for trajectory \( \mathcal {T} \) with its first point removed, and subcost(t,q) = 0 if dist(t,q) ≤ δ; 1 otherwise.
Given two trajectories \(\mathcal {T}_1\) and \(\mathcal {T}_3\) in Fig. 17, let δ = 1, we have \(EDR_{\delta }(\mathcal {T}_1,\mathcal {T}_3)=2\). To support EDR, for the MBR of each partition, we compute the distance. If it exceeds δ, subcost(t,q) is always equal to 1 and \(EDR_{\delta }(\mathcal {T},tr)=\max \limits (m,n)\), we safely prune this partition.
A.3 Longest Common Subsequence Distance (LCSS)
Definition 7 (LCSS)
Given two trajectories \(\mathcal {T}\) and \(\mathcal {Q}\) with lengths m and n, and a matching threshold δ, LCSSδ is defined as below [23]:
where \(\mathcal {T}^{m-1}\) is the prefix trajectory of \( \mathcal {T} \) with the last point removed.
Given two trajectories \(\mathcal {T}_1\) and \(\mathcal {T}_3\) in Fig. 17, let δ = 1, we have \(LCSS_{\delta }(\mathcal {T}_1,\mathcal {T}_3)=5\). Similar to EDR, for each partition’s MBR, we compute the distance to the query trajectory tr. According Definition 7, if it is beyond δ, \(LCSS_{\delta }(\mathcal {T},tr)\) is always equal to 0, we also safely prune this partition.
Appendix B: Comparison with other distance measures
We evaluated DFST’s performance with different distance measures, including Fréchet, DTW, EDR and LCSS (ε = 0.0001), in Fig. 18. We could observe that: (1) Fréchet was slower than DTW with the same threshold, because DTW sums the values up through the whole path from (1, 1) to (m,n) while Fréchet chooses the maximum value; (2) LCSS is as fast as EDR because the time complexity of both LCSS and EDR is O(mn).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tian, R., Li, J., Zhang, W. et al. A distributed framework for large-scale semantic trajectory similarity join. Multimed Tools Appl 83, 16205–16229 (2024). https://doi.org/10.1007/s11042-023-15236-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15236-w