The VC Dimension of Metric Balls under Fréchet and Hausdorff Distances

The Vapnik–Chervonenkis dimension provides a notion of complexity for systems of sets. If the VC dimension is small, then knowing this can drastically simplify fundamental computational tasks such as classification, range counting, and density estimation through the use of sampling bounds. We analyze set systems where the ground set X is a set of polygonal curves in Rd\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {R}^d$$\end{document} and the sets R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {R}$$\end{document} are metric balls defined by curve similarity metrics, such as the Fréchet distance and the Hausdorff distance, as well as their discrete counterparts. We derive upper and lower bounds on the VC dimension that imply useful sampling bounds in the setting that the number of curves is large, but the complexity of the individual curves is small. Our upper and lower bounds are either near-quadratic or near-linear in the complexity of the curves that define the ranges and they are logarithmic in the complexity of the curves that define the ground set.


Introduction
A range space (X , R) (also called set system) is defined by a ground set X and a set of ranges R, where each r ∈ R is a subset of X . A data structure for range searching answers queries for the subset of the input data that lies inside the query range. In range counting, we are interested only in the size of this subset. In our setting, a range is a metric ball defined by a curve and a radius. The ball contains all curves that lie within this radius from the center under a specific distance function (e.g., Fréchet or Hausdorff distance).
A crucial descriptor of any range space is its VC dimension [41,43,46] and related shattering dimension, which we define formally below. These notions quantify how complex a range space is, and have played fundamental roles in machine learning [7,45], data structures [17], and geometry [14,30]. For instance, specific bounds on these complexity parameters are critical for tasks as diverse as neural networks [7,36], artgallery problems [26,37,44], and kernel density estimation [35].
The last five years have seen a surge of interest in data structures for trajectory processing under the Fréchet distance, manifested in a series of publications [2,[8][9][10][11]15,21,22,24,29,47]. This was partially motivated by the increasing availability and quality of trajectory data from mobile phones, GPS sensors, RFID technology, and video analysis [28,38,48]. Initial results in this line of research, such as the approximate range counting data structure by de Berg et al. [10], use classical data structuring techniques. Afshani and Driemel extended their results and in addition showed lower bounds on the space-query-time trade-off in this setting [2]. In particular, they showed a lower bound which is exponential in the complexity of the curves for exact range searching. In 2017, ACM SIGSPATIAL, the premier conference for geographic information science, devoted their software challenge (GIS CUP) to the problem of range searching under the Fréchet distance [47]. Spurring further developments, the most recent results explore the use of heuristics [13] and randomization [16].
The Fréchet distance, named after Maurice Fréchet [25], is a popular distance measure for curves. Intuitively, it can be defined using the metaphor of a person walking a dog, where the person follows one curve and the dog follows the other curve, and throughout their traversal they are connected by a leash of fixed length. Both can vary their speed but they are not allowed to move backwards. The Fréchet distance corresponds to the length of the shortest dog leash that permits a traversal in this fashion. The Fréchet distance is very similar to the Hausdorff distance for sets [32], which is defined as the minimal maximum distance of a pair of points, one from each set, under all possible mappings between the two sets. The difference between the two distance measures is that the Fréchet distance requires the mapping to adhere to the ordering of the points along the curve. Both distance measures allow flexible associations between parts of the input elements which sets them apart from classical L p distances and makes them so suitable for trajectory data under varying speeds. One standard tool for computing the Fréchet distance of two curves is the free-space diagram which was introduced by Alt and Godau [6]. In the free-space diagram, we consider the polygonal curves as continuous curves [0, 1] → R d . The free-space for a given distance threshold ρ is a subset of the parametric space [0, 1] × [0, 1] that consists of all point pairs on the two curves at distance at most ρ. The vertices of the 2. ∀ (i k , j k ) ∈ T : i k+1 − i k ∈ {0, 1} and j k+1 − j k ∈ {0, 1}; 3. ∀ (i k , j k ) ∈ T : (i k+1 − i k ) + ( j k+1 − j k ) ≥ 1.

Definition 2.4 (discrete Fréchet distance)
Given polygonal curves V and U with vertices v 1 , . . . , v m 1 and u 1 , . . . , u m 2 respectively, we define the discrete Fréchet distance between V and U as the following function: where T denotes the set of all possible traversals for V and U .

Range Spaces
Each range space can be defined as a pair of sets (X , R), where X is the ground set and R ⊆ 2 X is the range set. Let (X , R) be a range space. For Y ⊆ X , we denote If R |Y contains all subsets of Y , then Y is shattered by R.

Definition 2.8 (shattering dimension)
The shattering dimension of (X , R) is the smallest δ such that, for all m, It is well known that for a range space (X , R) with VC dimension ν and shattering dimension δ that ν ≤ O(δ log δ) and δ = O(ν). So bounding the shattering dimension and bounding the VC dimension are asymptotically equivalent within a log factor. For a proof of this and other basic facts on range spaces we refer the reader to the textbook of Har-Peled [30]. Definition 2.9 (dual range space) Given a range space (X , R), for any p ∈ X we define It is a well-known fact that if a range space has VC dimension ν, then the dual range space has VC dimension ≤ 2 ν+1 (see e.g. [30]). There are many techniques for bounding the VC dimension of geometric range spaces. For instance, when the ground set is R d and the ranges are defined by inclusion in halfspaces, then the range space and its dual range space are isomorphic and both have VC dimension and shattering dimension d. When the ranges are defined by inclusion in balls, then the VC dimension and shattering dimension is d + 1, and the dual range spaces have bounds of d [30]. It is also, for instance, known [12] that the composition ranges formed as the k-fold union or intersection of ranges from a range space with bounded VC dimension ν induces a range space with VC dimension O(νk log k), and it was recently shown by Csikós et al. that this is tight even for some simple range spaces such as those defined by halfspaces [18,19]. More such results are deferred to Sect. 6.

Range Spaces Induced by Distance Measures
Let (M, d) be a pseudometric space. We define the ball of radius r and center p, under the distance measure d, as the following set: where p ∈ M. The doubling dimension of a metric space (M, d), denoted as ddim(M, d), is the smallest integer t such that any ball can be covered by at most 2 t balls of half the radius.
In this paper, we study the VC dimension of variants of range spaces (X , R) induced by pseudometric spaces 1 (M, d) by setting X = M and It is a reasonable question to ask whether the doubling dimension of a metric space influences the VC dimension of the induced range space. In general, a bounded doubling dimension does not imply a bounded VC dimension of the induced range space and vice versa. Recently, Huang et al. [34] showed that if we allow a small (1 + ε)distortion of the distance function d, the shattering dimension can be upper bounded by O(ε −O(ddim(M,d)) ). It is conceivable that the doubling dimension of the metric space of the discrete Fréchet distance and Hausdorff distance is bounded, as long as the underlying metric has bounded doubling dimension. However, for the continuous Fréchet distance, the doubling dimension is known to be unbounded [20]. Moreover, we will see that much better bounds can be obtained by a careful study of the specific distance measure.
Specifically, we study an unbalanced version of the above range space, in the sense that we distinguish between the complexity of objects of the ground set and the complexity of objects defining the ranges. In our case, the ground set consists of polygonal curves of complexity m, and the ranges are defined by polygonal curves of complexity k. To this end, we define, for any integers d and m, X d m := (R d ) m and we treat the elements of this set as ordered sets of points in R d of size m. Formally, we study range spaces with ground set X d m and a range set of the form under different variants of the Fréchet and Hausdorff distances. We emphasize that the range space consists of ranges of all radii.  While the VC dimension bounds for the discrete Hausdorff and Fréchet metric balls may seem like an easy implication of composition theorems for the VC dimension [12,18], we still find three things about these results remarkable:

Our Results
1. First consider the valid alignment paths in the free-space diagram: those are all sequences of cells which are monotonic in both coordinates, their first cell contains (0, 0), and their last cell contains (1, 1). For Fréchet variants, there are Θ(2 k 2 m ) valid alignment paths in the free-space diagram. And one may expect that these may materialize in the size of the composition theorem. Yet by a simple analysis of the shattering dimension, we show that they do not. 2. Second, the VC dimension only has logarithmic dependence on the size m of the curves in the ground set, rather than a polynomial dependence one would hope to obtain by simple application of composition theorems. This difference has important implications in analyzing real data sets where we can query with simple curves (small k), but may not have a small bound on the size of the curves in the data set (large m). 3. Third, for the continuous variants, the range spaces can indeed be decomposed into problems with ground sets defined on line segments. However, we do not know of a general d-dimensional bound on the VC dimension of range space with a ground set of segments, and ranges defined by segments within a radius r of another segment. We are able to circumvent this challenge with a technique to bound the VC dimension using a simple model of computation, and careful predicate design.

Our Approach
Our methods use the fact that both the Fréchet distance and the Hausdorff distance are determined by one of a discrete set of events, where each event involves a constant number of simple geometric objects. For example, it is well known that the Hausdorff distance between two discrete sets of points is equal to the distance between two points from the two sets. The corresponding event happens as we consider a value δ > 0 increasing from 0 and we record which points of one set are contained in which balls of radius δ centered at points from the other set. The same phenomenon is true for the discrete Fréchet distance between two point sequences. In particular, the so-called free-space matrix (the discrete version of the free-space diagram) which can be used to decide whether the discrete Fréchet distance is smaller than a given value δ encodes exactly the information about which pairs of points have distance at most δ. The basic phenomenon remains true for the continuous versions of the two distance measures if we extend the set of simple geometric objects to include line segments and if we also consider triple intersections. Each type of event can be translated into a range space of which we can analyze the VC dimension. Together, the product of the range spaces encodes the information, which curves lie inside which metric balls, in the form of a set system. This representation allows us to prove bounds on the VC dimension of metric balls under these distance measures.

Basic Idea: Discrete Fréchet and Hausdorff
In this section we prove our upper bounds in the discrete setting. Let X d m = (R d ) m ; we treat the elements of this set as ordered sets of points in R d of size m. The range spaces that we consider in this section are defined over the ground set X d m and the range set of balls under either the Hausdorff or the discrete Fréchet distance. The proofs in the subsequent sections all follow the basic idea of the proof in the discrete setting.
This implies that and hence, 2 We can similarly bound the shattering dimension δ, ) as in the proof of Theorem 5.1. Enforcing that a sequence contains a valid alignment path only reduces the number of possible distinct sets formed by t curves, and it can be determined using these intersections and the two orderings of B 1 , . . . , B k and of vertices within some S j ∈ X d m .

Preliminaries
In this section, we provide a more advanced set of geometric primitives and other known technical results about the VC dimension. We also derive some simple corollaries. Additionally, we provide some basic results about the distances which will couple with the geometric primitives in our proofs for continuous distance measures.
We again consider a ground set X d m = (R d ) m which we treat as a set of polygonal curves with points in R d of size m. Given such a curve s ∈ X d m , let V (s) be its ordered set of vertices and E(s) its ordered set of edges.

A Simple Model of Computation
We consider a model of computation that will be useful for modeling primitive geometric sets, and in turn bounding the VC dimension of an associated range space. These will be useful in that they allow the invocation of powerful and general tools to describe range spaces defined by distances between curves. We allow the following operations, which we call simple operations: -the arithmetic operations +, −, ×, and / on real numbers, -jumps conditioned on >, ≥, <, ≤, =, and = comparisons of real numbers, and -output 0 or 1.
We say a function requires t simple operations if it can be computed with a circuit of depth t composed only of these simple operations. Note that with the above simple operations, we can also perform logical operations. Furthermore, the lack of a squareroot operator creates some challenges when dealing with non-linear geometric objects. Therefore, we prove the following technical lemma showing that we can compare certain expressions involving square roots without computing them explicitly, i.e., only simple operations are needed for the comparison.
Proof It suffices to prove the case of α + √ β ≤ γ + √ δ, as α + √ β ≥ γ + √ δ is analogous. We simply show that this comparison is equivalent to a comparison involving only a constant number of simple operations starting from the values α, β, γ , δ. If The second equivalence holds because both sides are at least 0. Now, note that the right side of the last inequality is at least 0 and thus, if the left side is negative (which we can check using O(1) simple operations), we are done. Thus, assume the left side is at least 0. Then we can square both sides and obtain a comparison involving only simple operations. Now, if γ < α, we can do an analogous calculation, where we subtract γ instead of α in the first equivalence. As testing γ < α is a simple operation, we can determine which case we are in.

Geometric Primitives
For any p ∈ R d we denote by B r ( p) the ball of radius r , centered at p. For any two points s, t ∈ R d , we denote by st the line segment from s to t. Whenever we store such a line segment, for technicalities within the lemma below, we store the coordinates of its endpoints s and t. For any two points s, t ∈ R d , we define the stadium centered at where (st) denotes the line supporting the edge st. Finally, for any two points s, t ∈ R d , we define the capped cylinder centered at st: R r (st) = {p + u | p ∈ st and u ∈ R d s.t. u ≤ r and t − s, u = 0} (Fig. 1).
For each of these geometric sets, we can determine if a point x ∈ R d is in the set with a constant number of operations under a simple model of computation.

Lemma 6.2 For a point x ∈ R d , and any set of the form B r ( p), D r (st), C r (st), or R r (st), we can determine if x is in that set (returns 1, otherwise 0) using O(d) simple operations.
Proof For the ball B r ( p) we can compute a distance x − p 2 in O(d) time, and determine inclusion with a comparison to r 2 . For the cylinder C r (st) we can compute the closest point to x on this line as Then we can determine inclusion by comparing π st (x) − x 2 to r 2 . For the capped cylinder R r (st) we also need to compare π st (x) − t 2 and π st (x) − s 2 to see if either of these terms is greater than

Bounding the VC Dimension
For range spaces defined on continuous curves, our proofs use a powerful theorem from Goldberg and Jerrum [27] as improved and restated by Anthony and Bartlett [7]. It allows one to easily bound the VC dimension of geometric range spaces under our simple model of computation. Note that these bounds are not always tight. Specifically, because the VC dimension for ranges defined geometrically by balls B r ( p) is O(d) [30]. Moreover, the VC dimension of range spaces defined by cylinders C r (st) is known to be O(d) [4]. The ranges defined by capped cylinders R r (st) are the intersection of a cylinder and two halfspaces, each with VC dimension O(d) and hence, by the composition theorem [12], this full range space also has VC dimension O(d). Finally, the stadium D r (st) is defined by the union of a capped cylinder R r (st) and two balls B r (s) and B r (t); hence, again by the composition theorem [12], its VC dimension is O(d).
However, it is not clear that these improved bounds hold for the dual range spaces, aside from the case of B r . Moreover, when the ground set X of the range space (X , R) is not R d , then we need to be cautious in using the k-fold composition theorem [12], which bounds the VC dimension of complex range spaces derived as the logical intersection or union of simpler range spaces with bounded VC dimension. In the case of a ground set X = R d , logical and geometric intersections are the same, but for other ground sets (like dual objects, or line segments X d 2 ) this is not necessarily the case. For instance, a line segment e ∈ X d 2 may intersect a ball B r and also a halfspace H while not intersecting the intersection B r ∩ H .

Representation by Predicates
In order to prove bounds on the VC dimension of range spaces defined on continuous curves, we establish sets of geometric predicates which are sufficient to determine if two curves have distance at most r to each other. Analyzing the range spaces associated with these predicates (over all possible radii r ) allows us to compose them further and to establish VC dimension bounds for the range space induced by the corresponding distance measure. For the Fréchet and weak Fréchet distance, the predicates mirror those used in range searching data structures [1,2]. And for the Hausdorff distance on continuous curves, the predicates are derived from the Voronoi diagram [5]. The technical challenges for each case are similar, but require different analyses.

The Hausdorff Distance
We consider the range space (X d m , R r H k ), where R r H k denotes the set of all balls, of radius r , centered at curves in X d k , under the Hausdorff distance. 3 We also consider the same problems under both directed versions of the Hausdorff distance, and their induced range spaces

Hausdorff Distance Predicates
Consider two sets of line segments A and B such that any two segments that belong to the same set have disjoint interiors. Consider the Voronoi diagram of the vertices and open segments of B: each element of B (i.e., open segment or vertex) is assigned to a Voronoi cell which is the set of points that are closer to this element than to any other element (see Fig. 2). According to Alt et al. [5], the critical points for the Consider any two polygonal curves s ∈ X d m and q ∈ X d k . In order to encode the intersection of polygonal curves with metric balls under the Hausdorff metric, we will first define a subset of R d , a double-stadium, defined by two line segments {e 1 , e 2 } and a radius r as D r ,2 (e 1 , e 2 ) = D r (e 1 ) ∩ D r (e 2 ).
We will make use of the following predicates: P 1 (Vertex-edge (horizontal)) Given an edge of s, s j s j+1 , and a vertex q i of q, this predicate returns true iff there exists a point p ∈ s j s j+1 , such that p − q i ≤ r . P 2 (Vertex-edge (vertical)) Given an edge of q, q i q i+1 , and a vertex s j of s, this predicate returns true iff there exists a point p ∈ q i q i+1 , such that p − s j ≤ r . P 3 (d-stadium-line (horizontal)) Given an edge of q, q i q i+1 , and two edges of s, {e 1 , e 2 } ⊂ E(s), this predicate is equal to q i q i+1 ∈ D r ,2 (e 1 , e 2 ). P 4 (d-stadium-line (vertical)) Given one edge of s, s j s j+1 , and two edges of q, {e 1 , e 2 } ⊂ E(q), this predicate is equal to s j s j+1 ∈ D r ,2 (e 1 , e 2 ). Proof We first assume for the sake of simplicity that q is a line segment with endpoints q 1 and q 2 . We claim that d− → H (q, s) ≤ r if and only if there exists a sequence of edges s j 1 s j 1 +1 , s j 2 s j 2 +1 , . . . , s j v s j v +1 for some integer value v, such that the predicates P 1 (q 1 , s j 1 s j 1 +1 ), P 1 (q 2 , s j v s j v +1 ) both evaluate to true and the conjugate v−1 i=1 P 3 q 1 q 2 , s j i s j i +1 , s j i+1 s j i+1 +1 evaluates to true. Assume such a sequence of edges exists. In this case, there exists a sequence of points p 1 , . . . , p v on the line supporting q, with p 1 = q 1 , p v = q 2 , and such that for 1 ≤ i < v, p i , p i+1 ∈ D r (s j i s j i+1 ). That is, two consecutive points of the sequence are contained in the same stadium. Indeed, for i = 1 we have p 1 = q 1 and q 1 , p 2 ∈ s j 1 s j 1 +1 since the corresponding P 1 and P 3 predicates evaluate to true: Likewise, for i = v − 1, it is implied by the corresponding predicates P 1 (q 2 , s j v s j v +1 ) and P 3 q 1 q 2 , s j v−1 s j v−1 +1 , s j v s j v +1 . For the remaining 1 < i < v − 1, it follows from the conditions given by the specified P 3 predicates. Now, since each stadium is a convex set, it follows that each line segment connecting two consecutive points of this sequence p i , p i+1 is contained in one of the stadiums. Note that the set of line segments obtained this way forms a connected polygonal curve which fully covers the line segment q. It follows that Let w be the number of intersection points and let v = w + 2. We claim that this implies that there exists a sequence of edges s j 1 s j 1 +1 , s j 2 s j 2 +1 , . . . , s j v s j v +1 with the properties stated above. Let p 1 = q 1 , p v = q 2 , and let p i for 1 < i < v be the intersection points ordered in the direction of the line segment q. By construction, it must be that each p i for 1 < i < v is contained in the intersection of two stadiums, since it is the intersection with the boundary of a stadium and the entire edge is covered by the union of stadiums. Moreover, two consecutive points p i , p i+1 are contained in exactly the same subset of stadiums-otherwise there would be another intersection point with the boundary of a stadium in between p i and p i+1 . This implies a set of true predicates of type P 3 with the properties defined above. The predicates of type P 1 follow trivially from the definition of the directed Hausdorff distance. This concludes the proof of the other direction.
In general, for any polygonal curve q ∈ X d k with vertices q 1 , . . . , q k , we have that Thus, we can apply the arguments above to each edge of q individually. Similarly, we can prove that given the truth values of the predicates P 2 , P 4 one can determine whether d− → H (s, q) ≤ r , by an argument symmetric to the above.

Hausdorff Distance VC Dimension Bound
where α, β, γ , δ ∈ R can be computed using O(d) simple operations, or it is empty.
Proof We first compute the intersection of the infinite cylinder C r (uv) with the line x be the line (uv) parametrized by x ∈ R and g(y) = s + (t − s)y the line (st) parametrized by y ∈ R. We describe all values x, y parameterizing points in this intersection by quantifying the boundaries of this set. All points in the intersection of (st) with the boundary of the infinite cylinder C r (uv) are described by For any fixed y, this is a quadratic equation in x and the discriminant is Note that the quadratic equation has one solution exactly for those points on (st) which have distance r from (uv), because the ball around those points intersects (uv) exactly once. Those are also the points which define the boundary of (st) ∩ R r (uv). Thus, we want to solve h(y) = 0. As z i (y) is linear in y, we obtain a quadratic equation in y. Note that all coefficients of the quadratic equation can be computed in O(d) simple operations. Both solutions of this equation are of the form α ± √ β. If β < 0, then the intersection is empty. Otherwise, we obtain an intersection interval α − √ β, α + √ β for the infinite cylinder.
To obtain the intersection with the capped cylinder, we first compute the intersection of (st) with the top and bottom hyperplanes of the cylinder. The two planes are given by all p ∈ R d which satisfy ( p − u)(v − u) = 0 and ( p − v)(v − u) = 0, respectively. By plugging the line equation into the hyperplane formulas, we get the intersection points. For the first plane we thereby obtain The intersection with the second plane is analogous. Thus, we again obtain an interval for y such that the values in this interval induce the intersection points between the planes. Again O(d) simple operations are sufficient to compute the boundaries of this interval.
To obtain the intersection with the capped cylinder (not just with its boundary planes), we intersect the two intervals we obtained for the intersection with the infinite cylinder as well as the boundary planes of the capped cylinder. As computing the intersection of intervals is simply taking the minimum/maximum, we can use Lemma 6.1 to do this in O(1) simple operations. The values for α, β, γ , δ are then given by the intersection interval boundaries which are chosen from the boundaries of the intersection interval of the planes of the capped cylinder and the infinite cylinder.
Additionally, the following lemma holds, which states that we can express an intersection of a ball and a line with an interval of the form as in the previous lemma.

Lemma 7.3 Given a line (st) with st ∈ X d 2 and a ball B r (c) centered at c, the intersection (st) ∩ B r (c) of those two objects is either
where α, β, γ , δ ∈ R can be computed using O(d) simple operations, or it is empty.

Proof The intersection is given by the x fulfilling s+(t −s)x −c 2 ≤ r 2 . To determine the extremal values for x which satisfy this inequality is a quadratic equation in x.
Solving it, we obtain an intersection interval as required.
Having proven those technical lemmas, we are now ready to start our argument for bounding the VC dimension. We argue that the truth values for predicate P 1 over all possible inputs are uniquely defined by the set Similarly, the truth values for predicate P 2 are uniquely defined by the set Fig. 4 Illustration in R 2 of predicates used in the proof of Lemma 7.4 for the example given in Fig. 3 Then the predicates P 3 and P 4 induce sets (where effectively P 4 (q, s) = P 3 (s, q)) We require a technical proof, bounding the VC dimension of the range space defined on segments with ranges defined by double-stadiums. To this end, let be the families of subsets of line segments st ∈ X d 2 whose supported lines (st) intersect a common double-stadium D r ,2 (e 1 , e 2 ). We are now ready to state and prove the following lemma.

Lemma 7.4 The VC dimension of the range space (X d 2 , D d 2 ) and of the associated dual range space is O(d 2 ).
Proof The predicate which determines whether a line intersects a double-stadium D r ,2 (e 1 , e 2 ) can be implemented by taking the logical-or over O(1) calls to the following predicates (see Fig. 4 for an illustration): P B B : checks whether intersects D r ,2 (e 1 , e 2 ) in the intersection of two radius r balls, P R R : checks whether intersects D r ,2 (e 1 , e 2 ) in the intersection of two radius r capped cylinders, P R B : checks whether intersects D r ,2 (e 1 , e 2 ) in the intersection of one ball and one capped cylinder, both of radius r .
For all predicates we first compute the intersection interval of the capped cylinder or ball using Lemmas 7.2 or 7.3. Applying Lemma 6.1, we can then compute the intersection of these two intersection intervals by comparing their bounds, obtaining an interval of the form  O(d 2 ). Since an element of the dual range space is also defined by O(d) real values, and the same operations can be applied, the dual range space also has VC dimension O(d 2 ).
Using the above lemmas, we now get the following theorems.
The shattering dimension of (X d m , Proof Let S ⊂ X d m be a set of t polygonal curves and let q ∈ X d k . By Lemma 7.1, the set s ∈ S | d− → H (q, s) ≤ r is uniquely defined by the sets s∈S P r 1 (q, s), s∈S P r 3 (q, s).
The number of all possible sets r ≥0 s∈S P r . This follows by the upper bound of Corollary 6.4, on the VC dimension of the range space having as ground set the set of stadiums and ranges corresponding to stabbing points, and the fact that we need to consider k vertices for the query curve. Furthermore, by Lemma 7.4, we are able to bound the number of all possible sets r ≥0 s∈S P r 3 (q, s) as (tm) O(d 2 k) . The k term in the exponent arises because we consider all k edges of q for predicate P 3 . Hence, We can similarly bound the shattering dimension δ, The shattering dimension of (X d m , Proof Let S ⊂ X d m be a set of t polygonal curves and let q ∈ X d k . By Lemma 7.1, the set s ∈ S | d− → H (q, s) ≤ r is uniquely defined by the sets s∈S P r 2 (q, s), s∈S P r 4 (q, s).
The number of all possible sets r ≥0 s∈S P r 2 (q, s) is bounded by (tm) O(d 2 k) . This follows by the upper bound of Corollary 6.4, on the VC dimension of range spaces with points as the ground set and stadiums as ranges, and the fact that we need to consider one stadium for each of the k − 1 query edges. Furthermore, by Lemma 7.4, we are able to bound the number of all possible sets r ≥0 s∈S P r 4 (q, s) as (tm) O(d 2 k 2 ) . The k 2 term in the exponent arises because we consider Θ(k 2 ) pairs of edges of q for predicate P 4 . Now, We can similarly bound the shattering dimension δ, Now bounding the number of all possible such sets, as we did in the proofs of Theorems 7.5 and 7.6, implies the statement.

The Fréchet Distance
We consider the range spaces (X d m , R F k ) and (X d m , R wF k ), where R F k (resp. R wF k ) denotes the set of all balls, centered at curves in X d k , under the Fréchet (resp. weak Fréchet) distance.

Fréchet Distance Predicates
It is known that the Fréchet distance between two polygonal curves can be attained, either at a distance between their endpoints, at a distance between a vertex and a line supporting an edge, or at the common distance of two vertices with a line supporting an edge. The third type of event is sometimes called monotonicity event, since it happens when the weak Fréchet distance is smaller than the Fréchet distance. In this sense, our representation of the ball of radius r under the Fréchet distance is based on the following predicates, some of which we already used in the last section. Let s ∈ X d m with vertices s 1 , . . . , s m and q ∈ X d k with vertices q 1 , . . . , q k . P 1 (Vertex-edge (horizontal)) As defined in Sect. 7. P 2 (Vertex-edge (vertical)) As defined in Sect. 7. P 5 (Endpoints (start)) This predicate returns true if and only if s 1 − q 1 ≤ r . P 6 (Endpoints (end)) This predicate returns true if and only if s m − q k ≤ r . P 7 (Monotonicity (horizontal)) Given two vertices of s, s j and s t with j < t, and an edge of q, q i q i+1 , this predicate returns true if there exist two points p 1 and p 2 on the line supporting the directed edge, such that p 1 appears before p 2 on this line, and such that p 1 − s j ≤ r and p 2 − s t ≤ r . P 8 (Monotonicity (vertical)) Given two vertices of q, q i and q t with i < t, and a directed edge of s, s j s j+1 , this predicate returns true if there exist two points p 1 and p 2 on the line supporting the directed edge, such that p 1 appears before p 2 on this line, and such that p 1 − q i ≤ r and p 2 − q t ≤ r .
Predicate P 8 is illustrated in Fig. 5. Predicate P 7 is symmetric.

Lemma 8.1 ([1, Lem. 9])
Given the truth values of all predicates P 1 , P 2 , P 5 , P 6 , P 7 , P 8 of two curves s and q for a fixed value of r , one can determine if d F (s, q) ≤ r.
Predicates P 1 , P 2 , P 5 , P 6 are sufficient for representing metric balls under the weak Fréchet distance. We include a proof for the sake of completeness.

Lemma 8.2
Given the truth values of all predicates P 1 , P 2 , P 5 , P 6 of two curves s and q for a fixed value of r , one can determine if d wF (s, q) ≤ r.
Proof Alt and Godau [6] describe an algorithm for computing the weak Fréchet distance which can be used here. In particular, one can construct an edge-weighted grid graph on the cells (edge-edge pairs) of the parametric space of the two polygonal curves, and subsequently compute a bottleneck-shortest path from the pair of first edges to the pair of last edges along the two curves. We can use edge weights in {0, 1} to encode if the corresponding vertex-edge pair has distance at most r , as given by the predicates P 1 and P 2 . If and only if there exists a bottleneck shortest path of cost 0 and the endpoint conditions are satisfied (as given by the predicates P 5 and P 6 ), the weak Fréchet distance between q and s is at most r .

Fréchet Distance VC Dimension Bounds
We first consider the range space (X d m , R wF,k ), where R wF,k is the set of all balls under the weak Fréchet distance centered at curves in X d k . The main task is to translate the predicates P 1 , P 2 , P 5 , P 6 into simple range spaces, and then bound their associated VC dimensions. Consider any two polygonal curves s ∈ X d m and q ∈ X d k . In order to encode the intersection of polygonal curves with metric balls, we will make use of the sets P r 1 (q, s), P r 2 (q, s), which are defined in Sect. 7, and the following sets: The number of all possible sets r ≥0 s∈S P r 1 (q, s) and the number of all possible sets r ≥0 s∈S P r 2 (q, s) are both bounded by (tm) O(d 2 k) by Corollary 6.4 using set D r (st), and by considering the dual range space, respectively. q 1 q 2 q 1 q 2 Fig. 5 Illustration of predicate P 8 in R 2 with line and the two disks centered at q 1 and q 2 . In these examples, the projection of q 2 onto appears before the projection of q 1 onto along the direction of and the intersection of with the bisector lies outside of the lens formed by the two disks. On the left, the predicate is satisfied by setting p 1 = p 2 = π st (q 1 ). On the right, the predicate evaluates to false Notice that the number of all possible sets r ≥0 s∈S P r 5 (q, s) is bounded by (tm) O(d) . The same holds for the number of all possible sets r ≥0 s∈S P r 6 (q, s). Hence,

Theorem 8.3 Let R wF,k be the set of balls under the weak Fréchet metric centered at polygonal curves in
We can similarly bound the shattering dimension δ, We now consider the range space (X d m , R F,k ), where R F,k denotes the set of all balls, centered at curves in X d k , under the Fréchet distance. The approach is the same as with the weak Fréchet distance, except we also need to bound the VC dimension of range spaces associated with predicates P 7 and P 8 to encode monotonicity. For that, we can simply appeal to Theorem 6.3.
We need to define a set to represent predicates P 7 and P 8 . To this end, we again use X d 2 to represent the set of all segments in R d . Given radius r ≥ 0 and a line segment st, we define M r (st) to be the set containing all pairs of points (q 1 , q 2 ) for which there exist p 1 , p 2 ∈ , where st supports , such that p 1 − q 1 ≤ r and p 2 − q 2 ≤ r , p 1 is less than p 2 along the line, as The predicate P 7 is satisfied if and only if (s j , s t ) ∈ M r (q i q i+1 ) and predicate P 8 is satisfied if and only if (q i , q t ) ∈ M r (s j s j+1 ). Finally, we define M = {M r (st) | st ∈ X d 2 , r ≥ 0} to be the set of all relevant ranges. Proof The corollary directly follows from Lemma 7.4 by collapsing the stadiums to circles.
We define sets to correspond with predicates P 7 and P 8 : . And because this bound is proven using Theorem 6.3, then it applies to the dual range space, and we also bound the number of possible sets in r ≥0 s∈S P r 8 (q, s) as (tm) O(d 2 k 2 ) . The k 2 term arises because we consider Θ(k 2 ) pairs q i , q t for predicate P 8 . So, ultimately, Similarly, we can bound the shattering dimension δ,

Lower Bounds
Our lower bounds are constructed in the simplified setting that either k = 1 or m = 1, i.e., either the ground set or the curves defining the metric ball consist of one vertex only. In this case, all of our considered distance measures (except for one direction of the directed Hausdorff distance) are equal: Let d dH ( p, q) be the Hausdorff distance between V ( p) and V (q). It holds that Proof In the discrete case we interpret q ∈ X d k as an ordered or unordered sequence of points in R d . In this case, the proof follows directly from definitions (Sect. 2). In the continuous case we interpret q ∈ X d k as a continuous polygonal curve. In this case, the proof follows directly from the definitions and from the convexity of the Euclidean ball of radius r centered at the point p. If and only if all vertices of q are contained in this ball, the distance is less or equal r .
Because of the above lemma, any lower bound that we prove for the Hausdorff distance in the discrete setting automatically extends to the other distance measures.
The intuition of our proof is as follows. We construct a set of k points in R 2 that can be shattered by the ranges in R dH,k . The basic idea is that the ranges behave like convex polygons with k facets. In particular, the set of points contained inside the range centered at a curve q, is equal to the intersection of a set of equal-size Euclidean balls centered at the vertices of q.
Concretely, we place a set P of k ≥ 4 points evenly spaced on a unit circle centered at the origin, see Fig. 6. Let R > 2 be a parameter of the construction. For representing any subset of P we construct q using k vertices (in any order) placed on the origincentered circle of radius R − 1. In particular, we can force any p 0 ∈ P to be excluded from the metric ball under the Hausdorff distance of a fixed radius by placing a vertex on the line through the origin that contains p 0 and by adding this vertex to the vertex set of q. Using the k vertices in q we can specifically exclude any subset of up to k points from P by such a construction, and by placing a vertex of q at the origin we will not exclude any points. Hence any set P on the unit circle of size k can be shattered.
Proof Lemma 9.2 and [30,Lem. 5.18], which bounds the VC dimension of the dual range space as a function of the VC dimension of the primal space, imply the theorem. , log m)). Proof It follows by applying Lemmas 9.2 and 9.3 together with Lemma 9.1. Proof As in the proof of Lemma 9.2, our construction is set in the simplified setting where m = 1, i.e., the ground set corresponds to points in R d . We now show the theorem by reducing it to a recent lower bound of Csikós et al. [18] which is Ω(dk log k) for a related range space for d ≥ 4. This is defined on a ground set P ⊆ R d with ranges R k defined so that each range R ∈ R k is the intersection of k halfspaces.

Theorem 9.4 The VC dimension of the range spaces
Recall that the construction in the proof of Lemma 9.2 used the fact that for d = 2 the ranges behave like convex polygons. We can observe a similar behavior in higher dimensions. In particular, Lemma 9.1 implies that any range in R dH,k corresponds to the intersection of k balls in R d (centered at vertices of q). Observe that for a sufficiently large fixed radius R, for any point set P ⊆ R d , and for any halfspace H , we can find a ball of radius R which has the same inclusion properties as H . Finally, the lower bound of Csikós et al. [18] shows that there exists a set P of κ = Ω(dk log k) points which can be shattered by such ranges.  (dk log k, log dm)).
Proof It follows by applying Lemmas 9.5 and 9.6 together with Lemma 9.1.

Implications
In this section we demonstrate that bounds on the VC dimension for the range space defined by metric balls on curves immediately imply various results about prediction and statistical generalization over the space of curves. In the following consider a range space (X , R) with a ground set X of curves, where R are the ranges corresponding to metric balls for some distance measure we consider, and the VC dimension is bounded by ν. This section discusses accuracy bounds that depend directly on the size n = |X | and the VC dimension ν. We will assume that X is a random sample of some much larger set X big or an unknown continuous generating distribution μ. Under the randomness in this assumed sampling procedure, there is a probability of failure δ that often shows up in these bounds, but is minor since it shows up as log(1/δ). The following discussion leverages the concepts of ε-nets and ε-samples. The former (ε-nets) are samples which satisfy the property that if a range is heavy (contains an ε-fraction of the data) then the sample contains at least one point in that range; a sample of size O((ν/ε) log(ν/εδ)) is sufficient [33]. The latter (ε-samples) are samples which satisfy that each range's density is approximated within an additive ε-error; a sample of size O((ν −(log δ)/ε 2 ) is sufficient [39].
These bounds often take two closely-linked forms. First, given a limited set X from an unknown μ, then how accurate is a query or a prediction made using only X . Second, given the ability to draw samples (at a cost) from an unknown distribution μ, how many are required so that the prediction on the set of samples X has bounded prediction error. Upper bounds on ν imply pessimistic bounds on the accuracy or the required size for a sample.
Such large data sets of curves are now commonplace in many structured data applications. For instance, the millions of ride-sharing trips taken every day, or the GPS traces Apple and Google and others collect on users' phones, or the tracking of migrating animals. Because this data has a complex structure, and each associated curve may be large (i.e., m is large), it is not clear how well analyses on families of such curves can provably generalize to predict new data. The theme of the following results, as implied by our above VC dimension results, is that if these families of curves are only inspected with or queried with curves with a small number of segments (i.e., k is small), then the VC dimension of the associated range space ν = O(k log km) or O(k 2 log km) is small, and that such analyses generalize well. We show this in several concrete examples.
Approximate range counting on curves. Given a large set of curves X (of potentially very large complexity m) and a query curve q (with smaller complexity k) we would like to approximate the number of curves nearby q. For instance, we restrict X to historical queries at a certain time of day and query with the planned route q, and would like to know the chance of finding a carpool. VC dimension ν of the metric balls shows up directly in two analyses. First, if we assume that X has been chosen from an unknown distribution, i.e., X ∼ μ where μ is a much larger unknown distribution (but the real one), then we can estimate the accuracy of the fraction of all curves in this range within additive error O( (1/|X |)(ν + log(1/δ))). On the other hand, if we assume that X is a fixed input set which is too large to conveniently query, we can sample a subset S ⊂ X of size O((1/ε 2 )(ν + log(1/δ))) and know that the estimate for the fraction of curves from S in that range is within additive ε error of the fraction from X . Such sampling techniques have a long history in traditional databases [40], and have more recently become important when providing online estimates during a long query processing time as incrementally increasing size subsets are considered [3]. Ours provides the first formal analysis of these results for queries over curves. Moreover, the finite bound on VC dimension of these problems also implies [17] that there is a linear size data structure which can answer exact range queries in sublinear time.
Density estimation of curves. A related task in generalization to new curves is density estimation. Consider a large set of curves X which represent a larger unknown distribution μ that models a distribution of curves; we want to understand how unusual a new curve q would be, given we have not yet seen exactly the same curve before. One option is to use the distance to the (kth) nearest neighbor curve in X , or a bit more robust option is to choose a radius r and count how many curves are within that radius (e.g., the approximate range counting results above).
Alternatively, for X ⊂ M, consider now a kernel density estimate kde X : M → R defined by kde X ( p) = 1 n p∈P K (x, p) with kernel K (x, p) = exp(− d(x, p) 2 ) (where d is some distance of choice among curves, e.g., d F ). The kernel is defined so that each superlevel set K τ x = {p ∈ M | K (x, p) ≥ τ } corresponds to some range R ∈ R such that R ∩ X = K τ x ∩ X . Then a random sample S ⊂ X of size O ((1/ε 2 )(ν + log(1/δ))) satisfies kde X − kde s ∞ ≤ ε [35]. Thus, again the VC dimension ν of the metric balls directly influences these estimates accuracy, and for query curves with small complexity k the bound is quite reasonable.
Sample complexity for classification of curves. Now consider the problem of classifying curves representing trajectories of people or animals. For instance, with individuals who enable GPS on their cell phone they can label some work-to-home trajectories (as χ(x) = +1) or as other trips (χ(x) = −1). Then on unlabeled trips we can potentially predict which are work-to-home trajectories to build traffic and commute time models without manually labeling all routes. Similar tasks may be useful for normal (χ(x) = +1) versus nefarious (χ(x) = −1) activities when tracking people in an airport or a hostile zone. In each of these cases we may either have a very large number of labeled instances, and may want to sample them to some manageable size, or we may only have a limited number of samples, and want to know the accuracy to trust based on the sample size. All of these bounds are controlled by the VC dimension of the family of classifiers used to make the prediction. For trajectories, a sensible family of classifiers would be the ranges R defined by metric balls.
That is, consider some labeling function χ : X → {−1, +1}; now we say a range R ⊂ R misclassifies an object x ∈ X if x ∈ R and χ(x) = −1 or x / ∈ R and χ(x) = +1. If there exists a range R ⊂ R such that all x ∈ X ∩ R have χ(x) = +1 and all x ∈ X \ R have χ(x ) = −1, we say such a range perfectly separates (X , χ). Then a random sample Y ⊂ X of size O ((ν/ε) log(ν/εδ)) [33] ensures that, with probability at least 1 − δ, any range R ⊂ R which perfectly separates (Y , χ) misclassifies at most εn points in X .
Consider a random sample Y ⊂ X of size O((1/ε 2 )(ν + log(1/δ))). For any range R ⊂ R, if the fraction of points in Y is |R ∩ Y |/|Y | = η, then with probability at least 1 − δ, the fraction of points in X is |R ∩ X |/|X | ∈ [η − ε, η + ε]; that is, it is off by at most an ε-fraction [31,39]. If there is a labeling χ : X → {−1, +1}, this notably includes the range R ∈ R which misclassifies the least points (there may not be a perfect separator). Hence a random sampling ensures at most an ε-fraction more misclassified points are included in an estimate derived from this sample. Indeed, the RBF kernel K (x, p) = exp(− d(x, p) 2 ) defined above implies standard mechanism like kernel SVM or kernel perceptron [42] can be used to build classifiers, and together these bounds induce misclassification [39] and margin approximation bounds [35]. The small VC dimension ν implies they will generalize well. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.