Skip to main content
Log in

Approximate Nearest Neighbor for Curves: Simple, Efficient, and Deterministic

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

In the \((1+{\varepsilon },r)\)-approximate near-neighbor problem for curves (ANNC) under some similarity measure \(\delta \), the goal is to construct a data structure for a given set \(\mathcal {C}\) of curves that supports approximate near-neighbor queries: Given a query curve Q, if there exists a curve \(C\in \mathcal {C}\) such that \(\delta (Q,C)\le r\), then return a curve \(C'\in \mathcal {C}\) with \(\delta (Q,C')\le (1+{\varepsilon })r\). There exists an efficient reduction from the \((1+{\varepsilon })\)-approximate nearest-neighbor problem to ANNC, where in the former problem the answer to a query is a curve \(C\in \mathcal {C}\) with \(\delta (Q,C)\le (1+{\varepsilon })\cdot \delta (Q,C^*)\), where \(C^*\) is the curve of \(\mathcal {C}\) most similar to Q. Given a set \(\mathcal {C}\) of n curves, each consisting of m points in d dimensions, we construct a data structure for ANNC that uses \(n\cdot O(\frac{1}{{\varepsilon }})^{md}\) storage space and has O(md) query time (for a query curve of length m), where the similarity measure between two curves is their discrete Fréchet or dynamic time warping distance. Our method is simple to implement, deterministic, and results in an exponential improvement in both query time and storage space compared to all previous bounds. Further, we also consider the asymmetric version of ANNC, where the length of the query curves is \(k \ll m\), and obtain essentially the same storage and query bounds as above, except that m is replaced by k. Finally, we apply our method to a version of approximate range counting for curves and achieve similar bounds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Since our storage space is already in \(O(\frac{1}{{\varepsilon }})^{md}\), and \(m\cdot 2^{2\,m}\le 3^{2\,m}\) is in \(O(1)^{md}\), we could have used this larger upper bound. However, in Lemma 4 we show a tight upper bound on the number of relevant alignments, which may be useful for other applications.

  2. See [5] for a closely related more recent result on simplifications with bounded length.

References

  1. Afshani, P., Driemel, A.: On the complexity of range searching among curves. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pp 898–917, (2018), https://doi.org/10.1137/1.9781611975031.58

  2. Aronov, B., Filtser, O., Horton, M., Katz, M.J., Sheikhan, K.: Efficient nearest-neighbor query and clustering of planar curves. In: Algorithms and Data Structures—16th International Symposium, WADS 2019, Edmonton, AB, Canada, August 5–7, 2019, Proceedings, pp 28–42 (2019), https://doi.org/10.1007/978-3-030-24766-9_3

  3. Buchin, K., Driemel, A., Gudmundsson, J., Horton, M., Kostitsyna, I., Löffler, M., Struijs, M.: Approximating (k, l)-center clustering for curves. In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pp 2922–2938, (2019), https://doi.org/10.1137/1.9781611975482.181

  4. Bringmann, K., Driemel, A., Nusser, A., Psarros, I.: Tight bounds for approximate near neighbor searching for time series under the Fréchet distance. In: Symposium on Discrete Algorithms, SODA (2022)

  5. Buchin, M., Driemel, A., van Greevenbroek, K., Psarros, I., Rohde, D.: Approximating length-restricted means under dynamic time warping. In: Approximation and Online Algorithms—20th International Workshop, WAOA, volume 13538, pp 225–253, (2022), https://doi.org/10.1007/978-3-031-18367-6_12

  6. Bereg, S., Jiang, M., Wang, W., Yang, B., Zhu, B.: Simplifying 3D polygonal chains under the discrete Fréchet distance. In LATIN 2008: Theoretical Informatics, 8th Latin American Symposium, Búzios, Brazil, April 7-11, 2008, Proceedings, pp 630–641, (2008), https://doi.org/10.1007/978-3-540-78773-0_54

  7. Bringmann, K.: Why walking the dog takes time: Fréchet distance has no strongly subquadratic algorithms unless SETH fails. In: 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pp 661–670, 2014, https://doi.org/10.1109/FOCS.2014.76

  8. de Berg, M., Cook, A.F., IV., Gudmundsson, J.: Fast Fréchet queries. Comput. Geom. 46(6), 747–755 (2013). https://doi.org/10.1016/j.comgeo.2012.11.006

    Article  MATH  MathSciNet  Google Scholar 

  9. de Berg, M., Gudmundsson, J., Mehrabi, A. D.: A dynamic data structure for approximate proximity queries in trajectory data. In: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS 2017, Redondo Beach, CA, USA, November 7–10, 2017, pp 48:1–48:4, (2017), https://doi.org/10.1145/3139958.3140023

  10. Driemel, A., Har-Peled, S.: Jaywalking your dog: Computing the Fréchet distance with shortcuts. SIAM J. Comput. 42(5), 1830–1866 (2013). https://doi.org/10.1137/120865112

    Article  MATH  MathSciNet  Google Scholar 

  11. Driemel, A., Psarros, I.: ANN for time series under the Fréchet distance. In: A. Lubiw and M. R. Salavatipour, editors, Algorithms and Data Structures—17th International Symposium, WADS 2021, Virtual Event, August 9-11, 2021, Proceedings, volume 12808 of Lecture Notes in Computer Science, pp 315–328. Springer, (2021), https://doi.org/10.1007/978-3-030-83508-8_23

  12. Driemel, A., Psarros, I., Schmidt, M.: Sublinear data structures for short Fréchet queries. CoRR, abs/1907.04420, 2019, arXiv:1907.04420

  13. Driemel, A., Silvestri, F.: Locality-sensitive hashing of curves. In Proceedings of the 33rd International Symposium on Computational Geometry, volume 77, pp 37:1–37:16, Brisbane, Australia, July 2017. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, https://doi.org/10.4230/LIPIcs.SoCG.2017.37

  14. Emiris, I.Z., Psarros, I.: Products of Euclidean metrics, applied to proximity problems among curves: unified treatment of discrete Fréchet and dynamic time warping distances. ACM Trans. Spatial Algorithms Syst. 6(4), 27:1-27:20 (2020). https://doi.org/10.1145/3397518

    Article  Google Scholar 

  15. Filtser, A., Filtser, O., Katz, M. J.: Approximate nearest neighbor for curves—simple, efficient, and deterministic. In: A. Czumaj, A. Dawar, and E. Merelli, editors, 47th International Colloquium on Automata, Languages, and Programming, ICALP 2020, July 8-11, 2020, Saarbrücken, Germany (Virtual Conference), volume 168 of LIPIcs, pages 48:1–48:19. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020, https://doi.org/10.4230/LIPIcs.ICALP.2020.48

  16. Har-Peled, S., Indyk, P., Motwani, R.: Approximate nearest neighbor: towards removing the curse of dimensionality. Theory Comput. 8(1), 321–350 (2012). https://doi.org/10.4086/toc.2012.v008a014

    Article  MATH  MathSciNet  Google Scholar 

  17. Har-Peled, S., Kumar, N.: Approximate nearest neighbor search for low dimensional queries. In: D. Randall, editor, Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 23–25, 2011, pp 854–867. SIAM, 2011, https://doi.org/10.1137/1.9781611973082.67

  18. Indyk, P.: High-dimensional computational geometry. PhD thesis, Stanford University, 2000

  19. Indyk, P.: Approximate nearest neighbor algorithms for Fréchet distance via product metrics. In: Proceedings of the 8th Symposium on Computational Geometry, pp 102–106, Barcelona, Spain, June 2002. ACM Press, https://doi.org/10.1145/513400.513414

  20. Kumar, P., Mitchell, J. S. B., Yildirim, E. A.: Comuting core-sets and approximate smallest enclosing hyperspheres in high dimensions. In: Proceedings of the Fifth Workshop on Algorithm Engineering and Experiments, Baltimore, MD, USA, January 11, 2003, pp 45–55, (2003), https://doi.org/10.1145/996546.996548

  21. Lemire, D.: Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern Recogn. 42(9), 2169–2180 (2009). https://doi.org/10.1016/j.patcog.2008.11.030

    Article  MATH  Google Scholar 

  22. Megiddo, N.: Linear programming in linear time when the dimension is fixed. J. ACM 31(1), 114–127 (1984). https://doi.org/10.1145/2422.322418

    Article  MATH  MathSciNet  Google Scholar 

  23. Pagh, R., Rodler, F.F.: Cuckoo hashing. J. Algorithms 51(2), 122–144 (2004). https://doi.org/10.1016/j.jalgor.2003.12.002

    Article  MATH  MathSciNet  Google Scholar 

  24. Shakhnarovich, G., Darrell, T., Indyk, P.: Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (neural Information Processing). The MIT press, Cambridge (2006)

    Book  Google Scholar 

Download references

Acknowledgements

We wish to thank Boris Aronov for helpful discussions on the problems studied in this paper.

Funding

Arnold Filtser was partially supported by Grant 1042/22 from the Israel Science Foundation. Omrit Filtser was supported by the Eric and Wendy Schmidt Fund for Strategic Innovation, by the Council for Higher Education of Israel, and by Ben-Gurion University of the Negev. Matthew J. Katz was partially supported by Grant 1884/16 from the Israel Science Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omrit Filtser.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this paper excluding Sect. 6 and Appendices B and C has appeared in ICALP’20 [15].

Appendices

Appendix: A Deterministic Construction Using a Prefix Tree

When implementing the dictionary \(\mathcal {D}\) as a hash table, the construction of the data structure is randomized and thus in the worst case we might get higher prepeocessing time. To avoid this, we can implement \(\mathcal {D}\) as a prefix tree.

1.1 Appendix: A.1 Discrete Fréchet Distance

In this section we describe the implementation of \(\mathcal {D}\) as a prefix tree in the case of ANNC under DFD.

We can construct a prefix tree \(\mathcal {T}\) for the curves in \(\mathcal {I}\), where any path in \(\mathcal {T}\) from the root to a leaf corresponds to a curve that is stored in it. For each \(1\le i\le n\) and curve \({\overline{Q}}\in \mathcal {I}_i\), if \({\overline{Q}}\notin \mathcal {T}\), insert \({\overline{Q}}\) into \(\mathcal {T}\), and set \(C({\overline{Q}})\leftarrow C_i\).

Each node \(v\in \mathcal {T}\) corresponds to a grid point from \(\mathcal {G}\). Denote the set of v’s children by N(v). We store with v a multilevel search tree on N(v), with a level for each coordinate. The points in \(\mathcal {G}\) are the grid points contained in nm balls of radius \((1+{\varepsilon })r\). Thus when projecting these points to a single dimension, the number of 1-dimensional points is at most \(nm\cdot \frac{\sqrt{d}(1+{\varepsilon })2r}{{\varepsilon }r}=O(\frac{nm\sqrt{d}}{{\varepsilon }})\). So in each level of the search tree on N(v) we have \(O(\frac{nm\sqrt{d}}{{\varepsilon }})\) 1-dimensional points, so the query time is \(O(d\log (\frac{nmd}{{\varepsilon }}))\).

Inserting a curve of length m to the tree \(\mathcal {T}\) takes \(O(md\log (\frac{nmd}{{\varepsilon }}))\) time. Since \(\mathcal {T}\) is a compact representation of \(|\mathcal {I}|=n\cdot O(\frac{1}{{\varepsilon }})^{dm}\) curves of length m, the number of nodes in \(\mathcal {T}\) is \(m\cdot |\mathcal {I}|=nm\cdot O(\frac{1}{{\varepsilon }})^{dm}\). Each node \(v\in \mathcal {T}\) contains a search tree for its children of size \(O(d\cdot |N(v)|)\), and \(\sum _{v\in \mathcal {T}}|N(v)|=nm\cdot O(\frac{1}{{\varepsilon }})^{dm}\) so the total space complexity is \(O(nmd)\cdot O(\frac{1}{{\varepsilon }})^{md}=n\cdot O(\frac{1}{{\varepsilon }})^{md}\). Constructing \(\mathcal {T}\) takes \(O(|\mathcal {I}|\cdot md\log (\frac{nmd}{{\varepsilon }}))=n\log (\frac{nmd}{{\varepsilon }})\cdot O(\frac{1}{{\varepsilon }})^{md}\) time.

Theorem 23

There exists a data structure for the \((1+{\varepsilon },r)\)-ANNC under DFD, with \(n\cdot O(\frac{1}{{\varepsilon }})^{dm}\) space, \(n\cdot \log (\frac{n}{{\varepsilon }})\cdot O(\frac{1}{{\varepsilon }})^{md}\) preprocessing time, and \(O(md\log (\frac{nmd}{{\varepsilon }}))\) query time.

Similarly, for the asymmetric case we obtain the following theorem.

Theorem 24

There exists a data structure for the asymmetric \((1+{\varepsilon },r)\)-ANNC under DFD, with \(n\cdot O(\frac{1}{{\varepsilon }})^{dk}\) space, \(nm\log (\frac{n}{{\varepsilon }})\cdot \left( O(d\log m)+O(\frac{1}{{\varepsilon }})^{kd}\right) \) preprocessing time, and \(O(kd\log (\frac{nkd}{{\varepsilon }}))\) query time.

1.2 Appendix: A.2 \(\ell _{p,2}\)-Distance

For the case of ANNC under \(\ell _{p,2}\)-distance, the total number of curves stored in the tree \(\mathcal {T}\) is roughly the same as in the case of DFD. We only need to show that for a given node v of the tree \(\mathcal {T}\), the upper bound on the size and query time of the search tree associated with it are similar.

The grid points corresponding to the nodes in N(v) are from n sets of m balls with radius \((1+{\varepsilon })\). When projecting the grid points in one of the balls to a single dimension, the number of 1-dimensional points is at most \(\frac{m^{1/p}\sqrt{d}}{{\varepsilon }}\cdot (1+{\varepsilon })\), so the total number of projected points is at most \(\frac{nm^{1+\frac{1}{p}}\sqrt{d}}{{\varepsilon }}\cdot (1+{\varepsilon })\).

Thus in each level of the search tree of v we have \(O(\frac{nm^2\sqrt{d}}{{\varepsilon }})\) 1-dimensional points, so the query time is \(O(d\log (\frac{nmd}{{\varepsilon }}))\), and inserting a curve of length m into the tree \(\mathcal {T}\) takes \(O(md\log (\frac{nmd}{{\varepsilon }}))\) time. Note that the size of the search tree of v remains \(O(d\cdot |N(v)|)\).

We conclude that the total space complexity is \(O(\frac{nm^2\sqrt{d}}{{\varepsilon }})\cdot O(\frac{1}{{\varepsilon }})^{m(d+1)}=n\cdot O(\frac{1}{{\varepsilon }})^{m(d+1)}\), constructing \(\mathcal {T}\) takes \(O(|\mathcal {I}|\cdot md\log (nmd/{\varepsilon }))=n\log (\frac{n}{{\varepsilon }})\cdot O(\frac{1}{{\varepsilon }})^{m(d+1)}\) time, and the total query time is \(O(md\log (\frac{nmd}{{\varepsilon }}))\).

Theorem 25

There exists a data structure for the \((1+{\varepsilon },r)\)-ANNC under \(\ell _{p,2}\)-distance, with \(n\cdot O(\frac{1}{{\varepsilon }})^{m(d+1)}\) space, \(n\cdot \log (\frac{n}{{\varepsilon }})\cdot O(\frac{1}{{\varepsilon }})^{m(d+1)}\) preprocessing time, and \(O(md\log (\frac{nmd}{{\varepsilon }}))\) query time.

Appendix: B Dealing with Query Curves and Input Curves of Varying Size

Notice that if an input curve \(C_i\) has length \(t<m\), then the size of the set of candidates \(\mathcal {I}_i\) (and \(\mathcal {I}'_i\) in the asymmetric case) can only decrease.

In addition, our assumption that all query curves are of length exactly k can be easily removed by constructing k data structures \(\mathcal {D}_1,\dots ,\mathcal {D}_k\), where \(\mathcal {D}_i\) is our data structure constructed for query curves of length i (instead of k), for \(1 \le i \le k\). Clearly, the query time does not change. The storage space is multiplied by k, so for the case of DFD we have storage space \(nk\cdot O(\frac{1}{{\varepsilon }})^{kd}\), but \(k<2^{kd}\), so the storage space remains \(n\cdot O(\frac{1}{{\varepsilon }})^{kd}\). Similarly, for the case of \(\ell _{p,2}\)-distance we obtain storage space of \(n\cdot O(\frac{1}{{\varepsilon }})^{k(d+1)}\cdot \left( \frac{m}{k}\right) ^{kd/p}\).

Appendix: C One-Way Alignments

Claim 26

Let ABC be three curves, and let \(\tau _1\), \(\tau _2\) be two one-way alignments such that \(\tau _1\) matches C to A and \(\tau _2\) matches C to B. Then \(d_{p,2}(A,B)\le \sigma _{p,2}(\tau _1(C,A))+\sigma _{p,2}(\tau _2(C,B))\).

Proof

Denote by \(k_A,k_B,k_C\) the lengths of the curves ABC respectively. Consider the following algorithm that constructs an alignment \(\tau \). For every \(1\le x\le k_C\), denote by \(i_x,j_x\) the unique indexes such that \((x,i_x)\in \tau _1\) and \((x,j_x)\in \tau _2\). Add the pair \((i_x,j_x)\) to \(\tau \) if it is not already there.

First, we need to show that \(\tau =\langle (i_1,j_1),\dots ,(i_t,j_t)\rangle \) is a valid alignment. Clearly, \((i_1,j_1)=(1,1)\) because \((1,1)\in \tau _1\) and \((1,1)\in \tau _2\). Similarly, \((i_t,j_t)=(k_A,k_B)\) because \((k_C,k_A)\in \tau _1\) and \((k_C,k_B)\in \tau _2\).

For any \(1\le s<t\), consider the two consecutive pairs \((i_s,j_s),(i_{s+1},j_{s+1})\in \tau \). Let \(x_1\) be an index such that \((x_1,i_s)\in \tau _1\) and \((x_1,j_s)\in \tau _2\), and \(x_2\) an index such that \((x_2,i_{s+1})\in \tau _1\) and \((x_2,j_{s+1})\in \tau _2\). Since \(\tau _1,\tau _2\) are one-way alignments, we have \(x_1\ne x_2\). Moreover, since the algorithm added \((i_s,j_s)\) to \(\tau \) before \((i_{s+1},j_{s+1})\), we have \(x_1<x_2\). This implies that \(i_{s+1}\ge i_s\) and \(j_{s+1}\ge j_s\). Assume by contradiction that \(i_{s+1} > i_s+1\), and let x be the index such that \((x,i_s+1)\in \tau _1\), then \(x_1<x<x_2\) and thus the algorithm adds a pair \((i_s+1,j)\) for some index j after \((i_s,j_s)\) and before \((i_{s+1},j_{s+1})\), a contradiction. So we have \(i_s\le i_{s+1} \le i_s+1\), and by symmetric arguments, \(j_s\le j_{s+1} \le j_s+1\), and therefore \(\tau \) is valid.

Using the triangle inequality for the \(\ell _p\) norm, we get that

$$\begin{aligned} d_{p,2}(A,B)\le \sigma _{p,2}(\tau (A,B))&=\Big (\sum _{(i,j)\in \tau }\Vert a_{i}-b_{j}\Vert _{2}^{p}\Big )^{1/p}\\&\le \Big (\sum _{x=1}^{k_{C}}\Vert a_{i_{x}}-b_{j_{x}}\Vert _{2}^{p}\Big )^{1/p}\\&\le \Big (\sum _{x=1}^{k_{C}}\Vert a_{i_{x}}-c_{x}\Vert _{2}^{p}\Big )^{1/p}+\Big (\sum _{x=1}^{k_{C}}\Vert c_{x}-b_{j_{x}}\Vert _{2}^{p}\Big )^{1/p}\\&=\sigma _{p,2}(\tau _{1}(C,A))+\sigma _{p,2}(\tau _{2}(C,B))~. \end{aligned}$$

\(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Filtser, A., Filtser, O. & Katz, M.J. Approximate Nearest Neighbor for Curves: Simple, Efficient, and Deterministic. Algorithmica 85, 1490–1519 (2023). https://doi.org/10.1007/s00453-022-01080-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-022-01080-1

Keywords

Navigation