Abstract
In the \((1+{\varepsilon },r)\)-approximate near-neighbor problem for curves (ANNC) under some similarity measure \(\delta \), the goal is to construct a data structure for a given set \(\mathcal {C}\) of curves that supports approximate near-neighbor queries: Given a query curve Q, if there exists a curve \(C\in \mathcal {C}\) such that \(\delta (Q,C)\le r\), then return a curve \(C'\in \mathcal {C}\) with \(\delta (Q,C')\le (1+{\varepsilon })r\). There exists an efficient reduction from the \((1+{\varepsilon })\)-approximate nearest-neighbor problem to ANNC, where in the former problem the answer to a query is a curve \(C\in \mathcal {C}\) with \(\delta (Q,C)\le (1+{\varepsilon })\cdot \delta (Q,C^*)\), where \(C^*\) is the curve of \(\mathcal {C}\) most similar to Q. Given a set \(\mathcal {C}\) of n curves, each consisting of m points in d dimensions, we construct a data structure for ANNC that uses \(n\cdot O(\frac{1}{{\varepsilon }})^{md}\) storage space and has O(md) query time (for a query curve of length m), where the similarity measure between two curves is their discrete Fréchet or dynamic time warping distance. Our method is simple to implement, deterministic, and results in an exponential improvement in both query time and storage space compared to all previous bounds. Further, we also consider the asymmetric version of ANNC, where the length of the query curves is \(k \ll m\), and obtain essentially the same storage and query bounds as above, except that m is replaced by k. Finally, we apply our method to a version of approximate range counting for curves and achieve similar bounds.
Similar content being viewed by others
Notes
Since our storage space is already in \(O(\frac{1}{{\varepsilon }})^{md}\), and \(m\cdot 2^{2\,m}\le 3^{2\,m}\) is in \(O(1)^{md}\), we could have used this larger upper bound. However, in Lemma 4 we show a tight upper bound on the number of relevant alignments, which may be useful for other applications.
See [5] for a closely related more recent result on simplifications with bounded length.
References
Afshani, P., Driemel, A.: On the complexity of range searching among curves. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pp 898–917, (2018), https://doi.org/10.1137/1.9781611975031.58
Aronov, B., Filtser, O., Horton, M., Katz, M.J., Sheikhan, K.: Efficient nearest-neighbor query and clustering of planar curves. In: Algorithms and Data Structures—16th International Symposium, WADS 2019, Edmonton, AB, Canada, August 5–7, 2019, Proceedings, pp 28–42 (2019), https://doi.org/10.1007/978-3-030-24766-9_3
Buchin, K., Driemel, A., Gudmundsson, J., Horton, M., Kostitsyna, I., Löffler, M., Struijs, M.: Approximating (k, l)-center clustering for curves. In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pp 2922–2938, (2019), https://doi.org/10.1137/1.9781611975482.181
Bringmann, K., Driemel, A., Nusser, A., Psarros, I.: Tight bounds for approximate near neighbor searching for time series under the Fréchet distance. In: Symposium on Discrete Algorithms, SODA (2022)
Buchin, M., Driemel, A., van Greevenbroek, K., Psarros, I., Rohde, D.: Approximating length-restricted means under dynamic time warping. In: Approximation and Online Algorithms—20th International Workshop, WAOA, volume 13538, pp 225–253, (2022), https://doi.org/10.1007/978-3-031-18367-6_12
Bereg, S., Jiang, M., Wang, W., Yang, B., Zhu, B.: Simplifying 3D polygonal chains under the discrete Fréchet distance. In LATIN 2008: Theoretical Informatics, 8th Latin American Symposium, Búzios, Brazil, April 7-11, 2008, Proceedings, pp 630–641, (2008), https://doi.org/10.1007/978-3-540-78773-0_54
Bringmann, K.: Why walking the dog takes time: Fréchet distance has no strongly subquadratic algorithms unless SETH fails. In: 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pp 661–670, 2014, https://doi.org/10.1109/FOCS.2014.76
de Berg, M., Cook, A.F., IV., Gudmundsson, J.: Fast Fréchet queries. Comput. Geom. 46(6), 747–755 (2013). https://doi.org/10.1016/j.comgeo.2012.11.006
de Berg, M., Gudmundsson, J., Mehrabi, A. D.: A dynamic data structure for approximate proximity queries in trajectory data. In: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS 2017, Redondo Beach, CA, USA, November 7–10, 2017, pp 48:1–48:4, (2017), https://doi.org/10.1145/3139958.3140023
Driemel, A., Har-Peled, S.: Jaywalking your dog: Computing the Fréchet distance with shortcuts. SIAM J. Comput. 42(5), 1830–1866 (2013). https://doi.org/10.1137/120865112
Driemel, A., Psarros, I.: ANN for time series under the Fréchet distance. In: A. Lubiw and M. R. Salavatipour, editors, Algorithms and Data Structures—17th International Symposium, WADS 2021, Virtual Event, August 9-11, 2021, Proceedings, volume 12808 of Lecture Notes in Computer Science, pp 315–328. Springer, (2021), https://doi.org/10.1007/978-3-030-83508-8_23
Driemel, A., Psarros, I., Schmidt, M.: Sublinear data structures for short Fréchet queries. CoRR, abs/1907.04420, 2019, arXiv:1907.04420
Driemel, A., Silvestri, F.: Locality-sensitive hashing of curves. In Proceedings of the 33rd International Symposium on Computational Geometry, volume 77, pp 37:1–37:16, Brisbane, Australia, July 2017. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, https://doi.org/10.4230/LIPIcs.SoCG.2017.37
Emiris, I.Z., Psarros, I.: Products of Euclidean metrics, applied to proximity problems among curves: unified treatment of discrete Fréchet and dynamic time warping distances. ACM Trans. Spatial Algorithms Syst. 6(4), 27:1-27:20 (2020). https://doi.org/10.1145/3397518
Filtser, A., Filtser, O., Katz, M. J.: Approximate nearest neighbor for curves—simple, efficient, and deterministic. In: A. Czumaj, A. Dawar, and E. Merelli, editors, 47th International Colloquium on Automata, Languages, and Programming, ICALP 2020, July 8-11, 2020, Saarbrücken, Germany (Virtual Conference), volume 168 of LIPIcs, pages 48:1–48:19. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020, https://doi.org/10.4230/LIPIcs.ICALP.2020.48
Har-Peled, S., Indyk, P., Motwani, R.: Approximate nearest neighbor: towards removing the curse of dimensionality. Theory Comput. 8(1), 321–350 (2012). https://doi.org/10.4086/toc.2012.v008a014
Har-Peled, S., Kumar, N.: Approximate nearest neighbor search for low dimensional queries. In: D. Randall, editor, Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 23–25, 2011, pp 854–867. SIAM, 2011, https://doi.org/10.1137/1.9781611973082.67
Indyk, P.: High-dimensional computational geometry. PhD thesis, Stanford University, 2000
Indyk, P.: Approximate nearest neighbor algorithms for Fréchet distance via product metrics. In: Proceedings of the 8th Symposium on Computational Geometry, pp 102–106, Barcelona, Spain, June 2002. ACM Press, https://doi.org/10.1145/513400.513414
Kumar, P., Mitchell, J. S. B., Yildirim, E. A.: Comuting core-sets and approximate smallest enclosing hyperspheres in high dimensions. In: Proceedings of the Fifth Workshop on Algorithm Engineering and Experiments, Baltimore, MD, USA, January 11, 2003, pp 45–55, (2003), https://doi.org/10.1145/996546.996548
Lemire, D.: Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern Recogn. 42(9), 2169–2180 (2009). https://doi.org/10.1016/j.patcog.2008.11.030
Megiddo, N.: Linear programming in linear time when the dimension is fixed. J. ACM 31(1), 114–127 (1984). https://doi.org/10.1145/2422.322418
Pagh, R., Rodler, F.F.: Cuckoo hashing. J. Algorithms 51(2), 122–144 (2004). https://doi.org/10.1016/j.jalgor.2003.12.002
Shakhnarovich, G., Darrell, T., Indyk, P.: Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (neural Information Processing). The MIT press, Cambridge (2006)
Acknowledgements
We wish to thank Boris Aronov for helpful discussions on the problems studied in this paper.
Funding
Arnold Filtser was partially supported by Grant 1042/22 from the Israel Science Foundation. Omrit Filtser was supported by the Eric and Wendy Schmidt Fund for Strategic Innovation, by the Council for Higher Education of Israel, and by Ben-Gurion University of the Negev. Matthew J. Katz was partially supported by Grant 1884/16 from the Israel Science Foundation.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix: A Deterministic Construction Using a Prefix Tree
When implementing the dictionary \(\mathcal {D}\) as a hash table, the construction of the data structure is randomized and thus in the worst case we might get higher prepeocessing time. To avoid this, we can implement \(\mathcal {D}\) as a prefix tree.
1.1 Appendix: A.1 Discrete Fréchet Distance
In this section we describe the implementation of \(\mathcal {D}\) as a prefix tree in the case of ANNC under DFD.
We can construct a prefix tree \(\mathcal {T}\) for the curves in \(\mathcal {I}\), where any path in \(\mathcal {T}\) from the root to a leaf corresponds to a curve that is stored in it. For each \(1\le i\le n\) and curve \({\overline{Q}}\in \mathcal {I}_i\), if \({\overline{Q}}\notin \mathcal {T}\), insert \({\overline{Q}}\) into \(\mathcal {T}\), and set \(C({\overline{Q}})\leftarrow C_i\).
Each node \(v\in \mathcal {T}\) corresponds to a grid point from \(\mathcal {G}\). Denote the set of v’s children by N(v). We store with v a multilevel search tree on N(v), with a level for each coordinate. The points in \(\mathcal {G}\) are the grid points contained in nm balls of radius \((1+{\varepsilon })r\). Thus when projecting these points to a single dimension, the number of 1-dimensional points is at most \(nm\cdot \frac{\sqrt{d}(1+{\varepsilon })2r}{{\varepsilon }r}=O(\frac{nm\sqrt{d}}{{\varepsilon }})\). So in each level of the search tree on N(v) we have \(O(\frac{nm\sqrt{d}}{{\varepsilon }})\) 1-dimensional points, so the query time is \(O(d\log (\frac{nmd}{{\varepsilon }}))\).
Inserting a curve of length m to the tree \(\mathcal {T}\) takes \(O(md\log (\frac{nmd}{{\varepsilon }}))\) time. Since \(\mathcal {T}\) is a compact representation of \(|\mathcal {I}|=n\cdot O(\frac{1}{{\varepsilon }})^{dm}\) curves of length m, the number of nodes in \(\mathcal {T}\) is \(m\cdot |\mathcal {I}|=nm\cdot O(\frac{1}{{\varepsilon }})^{dm}\). Each node \(v\in \mathcal {T}\) contains a search tree for its children of size \(O(d\cdot |N(v)|)\), and \(\sum _{v\in \mathcal {T}}|N(v)|=nm\cdot O(\frac{1}{{\varepsilon }})^{dm}\) so the total space complexity is \(O(nmd)\cdot O(\frac{1}{{\varepsilon }})^{md}=n\cdot O(\frac{1}{{\varepsilon }})^{md}\). Constructing \(\mathcal {T}\) takes \(O(|\mathcal {I}|\cdot md\log (\frac{nmd}{{\varepsilon }}))=n\log (\frac{nmd}{{\varepsilon }})\cdot O(\frac{1}{{\varepsilon }})^{md}\) time.
Theorem 23
There exists a data structure for the \((1+{\varepsilon },r)\)-ANNC under DFD, with \(n\cdot O(\frac{1}{{\varepsilon }})^{dm}\) space, \(n\cdot \log (\frac{n}{{\varepsilon }})\cdot O(\frac{1}{{\varepsilon }})^{md}\) preprocessing time, and \(O(md\log (\frac{nmd}{{\varepsilon }}))\) query time.
Similarly, for the asymmetric case we obtain the following theorem.
Theorem 24
There exists a data structure for the asymmetric \((1+{\varepsilon },r)\)-ANNC under DFD, with \(n\cdot O(\frac{1}{{\varepsilon }})^{dk}\) space, \(nm\log (\frac{n}{{\varepsilon }})\cdot \left( O(d\log m)+O(\frac{1}{{\varepsilon }})^{kd}\right) \) preprocessing time, and \(O(kd\log (\frac{nkd}{{\varepsilon }}))\) query time.
1.2 Appendix: A.2 \(\ell _{p,2}\)-Distance
For the case of ANNC under \(\ell _{p,2}\)-distance, the total number of curves stored in the tree \(\mathcal {T}\) is roughly the same as in the case of DFD. We only need to show that for a given node v of the tree \(\mathcal {T}\), the upper bound on the size and query time of the search tree associated with it are similar.
The grid points corresponding to the nodes in N(v) are from n sets of m balls with radius \((1+{\varepsilon })\). When projecting the grid points in one of the balls to a single dimension, the number of 1-dimensional points is at most \(\frac{m^{1/p}\sqrt{d}}{{\varepsilon }}\cdot (1+{\varepsilon })\), so the total number of projected points is at most \(\frac{nm^{1+\frac{1}{p}}\sqrt{d}}{{\varepsilon }}\cdot (1+{\varepsilon })\).
Thus in each level of the search tree of v we have \(O(\frac{nm^2\sqrt{d}}{{\varepsilon }})\) 1-dimensional points, so the query time is \(O(d\log (\frac{nmd}{{\varepsilon }}))\), and inserting a curve of length m into the tree \(\mathcal {T}\) takes \(O(md\log (\frac{nmd}{{\varepsilon }}))\) time. Note that the size of the search tree of v remains \(O(d\cdot |N(v)|)\).
We conclude that the total space complexity is \(O(\frac{nm^2\sqrt{d}}{{\varepsilon }})\cdot O(\frac{1}{{\varepsilon }})^{m(d+1)}=n\cdot O(\frac{1}{{\varepsilon }})^{m(d+1)}\), constructing \(\mathcal {T}\) takes \(O(|\mathcal {I}|\cdot md\log (nmd/{\varepsilon }))=n\log (\frac{n}{{\varepsilon }})\cdot O(\frac{1}{{\varepsilon }})^{m(d+1)}\) time, and the total query time is \(O(md\log (\frac{nmd}{{\varepsilon }}))\).
Theorem 25
There exists a data structure for the \((1+{\varepsilon },r)\)-ANNC under \(\ell _{p,2}\)-distance, with \(n\cdot O(\frac{1}{{\varepsilon }})^{m(d+1)}\) space, \(n\cdot \log (\frac{n}{{\varepsilon }})\cdot O(\frac{1}{{\varepsilon }})^{m(d+1)}\) preprocessing time, and \(O(md\log (\frac{nmd}{{\varepsilon }}))\) query time.
Appendix: B Dealing with Query Curves and Input Curves of Varying Size
Notice that if an input curve \(C_i\) has length \(t<m\), then the size of the set of candidates \(\mathcal {I}_i\) (and \(\mathcal {I}'_i\) in the asymmetric case) can only decrease.
In addition, our assumption that all query curves are of length exactly k can be easily removed by constructing k data structures \(\mathcal {D}_1,\dots ,\mathcal {D}_k\), where \(\mathcal {D}_i\) is our data structure constructed for query curves of length i (instead of k), for \(1 \le i \le k\). Clearly, the query time does not change. The storage space is multiplied by k, so for the case of DFD we have storage space \(nk\cdot O(\frac{1}{{\varepsilon }})^{kd}\), but \(k<2^{kd}\), so the storage space remains \(n\cdot O(\frac{1}{{\varepsilon }})^{kd}\). Similarly, for the case of \(\ell _{p,2}\)-distance we obtain storage space of \(n\cdot O(\frac{1}{{\varepsilon }})^{k(d+1)}\cdot \left( \frac{m}{k}\right) ^{kd/p}\).
Appendix: C One-Way Alignments
Claim 26
Let A, B, C be three curves, and let \(\tau _1\), \(\tau _2\) be two one-way alignments such that \(\tau _1\) matches C to A and \(\tau _2\) matches C to B. Then \(d_{p,2}(A,B)\le \sigma _{p,2}(\tau _1(C,A))+\sigma _{p,2}(\tau _2(C,B))\).
Proof
Denote by \(k_A,k_B,k_C\) the lengths of the curves A, B, C respectively. Consider the following algorithm that constructs an alignment \(\tau \). For every \(1\le x\le k_C\), denote by \(i_x,j_x\) the unique indexes such that \((x,i_x)\in \tau _1\) and \((x,j_x)\in \tau _2\). Add the pair \((i_x,j_x)\) to \(\tau \) if it is not already there.
First, we need to show that \(\tau =\langle (i_1,j_1),\dots ,(i_t,j_t)\rangle \) is a valid alignment. Clearly, \((i_1,j_1)=(1,1)\) because \((1,1)\in \tau _1\) and \((1,1)\in \tau _2\). Similarly, \((i_t,j_t)=(k_A,k_B)\) because \((k_C,k_A)\in \tau _1\) and \((k_C,k_B)\in \tau _2\).
For any \(1\le s<t\), consider the two consecutive pairs \((i_s,j_s),(i_{s+1},j_{s+1})\in \tau \). Let \(x_1\) be an index such that \((x_1,i_s)\in \tau _1\) and \((x_1,j_s)\in \tau _2\), and \(x_2\) an index such that \((x_2,i_{s+1})\in \tau _1\) and \((x_2,j_{s+1})\in \tau _2\). Since \(\tau _1,\tau _2\) are one-way alignments, we have \(x_1\ne x_2\). Moreover, since the algorithm added \((i_s,j_s)\) to \(\tau \) before \((i_{s+1},j_{s+1})\), we have \(x_1<x_2\). This implies that \(i_{s+1}\ge i_s\) and \(j_{s+1}\ge j_s\). Assume by contradiction that \(i_{s+1} > i_s+1\), and let x be the index such that \((x,i_s+1)\in \tau _1\), then \(x_1<x<x_2\) and thus the algorithm adds a pair \((i_s+1,j)\) for some index j after \((i_s,j_s)\) and before \((i_{s+1},j_{s+1})\), a contradiction. So we have \(i_s\le i_{s+1} \le i_s+1\), and by symmetric arguments, \(j_s\le j_{s+1} \le j_s+1\), and therefore \(\tau \) is valid.
Using the triangle inequality for the \(\ell _p\) norm, we get that
\(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Filtser, A., Filtser, O. & Katz, M.J. Approximate Nearest Neighbor for Curves: Simple, Efficient, and Deterministic. Algorithmica 85, 1490–1519 (2023). https://doi.org/10.1007/s00453-022-01080-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-022-01080-1