Sample-based distance-approximation for subsequence-freeness

In this work, we study the problem of approximating the distance to subsequence-freeness in the sample-based distribution-free model. For a given subsequence (word) $w = w_1 \dots w_k$, a sequence (text) $T = t_1 \dots t_n$ is said to contain $w$ if there exist indices $1 \leq i_1<\dots<i_k \leq n$ such that $t_{i_{j}} = w_j$ for every $1 \leq j \leq k$. Otherwise, $T$ is $w$-free. Ron and Rosin (ACM TOCT 2022) showed that the number of samples both necessary and sufficient for one-sided error testing of subsequence-freeness in the sample-based distribution-free model is $\Theta(k/\epsilon)$. Denoting by $\Delta(T,w,p)$ the distance of $T$ to $w$-freeness under a distribution $p :[n]\to [0,1]$, we are interested in obtaining an estimate $\widehat{\Delta}$, such that $|\widehat{\Delta} - \Delta(T,w,p)| \leq \delta$ with probability at least $2/3$, for a given distance parameter $\delta$. Our main result is an algorithm whose sample complexity is $\tilde{O}(k^2/\delta^2)$. We first present an algorithm that works when the underlying distribution $p$ is uniform, and then show how it can be modified to work for any (unknown) distribution $p$. We also show that a quadratic dependence on $1/\delta$ is necessary.


Introduction
Distance approximation algorithms, as defined in [28], are sublinear algorithms that approximate (with constant success probability) the distance of objects from satisfying a prespecified property P. Distance approximation (and the closely related notion of tolerant testing) is an extension of property testing [30,19], where the goal is to distinguish between objects that satisfy a property P and those that are far from satisfying the property. 1In this work we consider the property of subsequence-freeness.For a given subsequence (word) w 1 . . .w k over some alphabet Σ, a sequence (text) T = t 1 . . .t n over Σ is said to be w-free if there do not exist indices 1 ≤ j 1 < • • • < j k ≤ n such that t j i = w i for every i ∈ [k]. 2n most previous works on property testing and distance approximation, the algorithm is allowed query access to the object, and distance to satisfying the property in question, P, is defined as the minimum Hamming distance to an object that satisfies P, normalized by the size of the object.In this work we consider the more challenging, and sometimes more suitable, sample-based model in which the algorithm is only given a random sample from the object.In particular, when the object is a sequence T = t 1 . . .t n , each element in the sample is a pair (j, t j ).
We study both the case in which the underlying distribution according to which each index j is selected (independently) is the uniform distribution over [n], and the more general case in which the underlying distribution is some arbitrary unknown p : [n] → [0, 1].We refer to the former as the uniform sample-based model, and to the latter as the distribution-free sample-based model.The distance (to satisfying the property) is determined by the underlying distribution.Namely, it is the minimum total weight according to p of indices j such that t j must be modified so as to make the sequence w-free.Hence, in the uniform sample-based model, the distance measure is simply the Hamming distance normalized by n.
The related problem of testing the property of subsequence-freeness in the distribution-free sample-based model was studied by Ron and Rosin [29].They showed that the sample-complexity of one-sided error testing of subsequence-freeness in this model is Θ(k/ǫ) (where ǫ is the given distance parameter).A natural question is whether we can design a sublinear algorithm, with small sample complexity, that actually approximates the distance of a text T to w-freeness.It is worth noting that, in general, tolerant testing (and hence distance-approximation) for a property may be much harder than testing the property [17,3].

Our results
In what follows, when we say that a sample is selected uniformly from T , we mean that for each sample point (j, t j ), j is selected uniformly and independently from [n].This generalizes to the case in which the underlying distribution is an arbitrary distribution p.
We start by designing a distance-approximation algorithm in the uniform sample-based model.Let ∆(T, w) denote the distance under the uniform distribution of T from being w-free (which equals the fraction of symbols in T that must be modified so as to obtain a w-free text), and let δ ∈ (0, 1) denote the error parameter given to the algorithm.

A high-level discussion of our algorithms
Our starting point is a structural characterization of the distance to w-freeness under the uniform distribution, which is proved in [29,Sec. 3.1]. 4In order to state their characterization, we introduce the notion of copies of w in T , and more specifically, role-disjoint copies.
A copy of w = w 1 . . .w k in T = t 1 . . .t n is a sequence of indices (j 1 , . . ., j k ) such that 1 ≤ j 1 < • • • < j k ≤ n and t j 1 . . .t j k = w.It will be convenient to represent a copy as an array C of size k where C[i] = j i .A set of copies {C ℓ } is said to be role-disjoint if for every i ∈ [k], the indices in {C ℓ [i]} are distinct (though it is possible that C ℓ [i] = C ℓ ′ [i ′ ] for i = i ′ (and ℓ = ℓ ′ )).In the special case where the symbols of w are all different from each other, a set of copies is role disjoint simply if it consists of disjoint copies.Ron and Rosin prove [29, Theorem 3.4 + Claim 3.1] that ∆(T, w) equals the maximum number of role-disjoint copies of w in T , divided by n.
Note that the analysis of the sample complexity of one-sided error sample-based testing of subsequence-freeness translates to bounding the size of the sample that is sufficient and necessary for ensuring that the sample contains evidence that T is not w-free when ∆(T, w) > ǫ.Here evidence is in the form of a copy of w in the sample, so that the testing algorithm simply checks whether such a copy exists.On the other hand, the question of distance-approximation has a more algorithmic flavor, as it is not determined by the problem what must be done by the algorithm given a sample.
Focusing first on the uniform case, Ron and Rosin used their characterization (more precisely, the direction by which if ∆(T, w) > ǫ, then T contains more than ǫn role-disjoint copies of w), to prove that a sample of size Θ(k/ǫ) contains at least one copy of w with probability at least 2/3.
In this work we go further by designing an algorithm that actually approximates the number of role-disjoint copies of w in T (and hence approximates ∆(T, w)), given a uniformly selected sample from T .It is worth noting that the probability of obtaining a copy in the sample might be quite different for texts that have exactly the same number of role-disjoint copies of w (and hence the same distance to being w-free). 5n the next subsection we discuss the aforementioned algorithm (for the uniform case), and in the following one address the distribution-free case.

The uniform case
Let R(T, w) denote the number of role-disjoint copies of w in T .In a nutshell, the algorithm works by computing estimates of the numbers of occurrences of symbols of w in a relatively small number of prefixes of T , and using them to derive an estimate of R(T, w).The more precise description of the algorithm and its analysis are based on several combinatorial claims that we present and which we discuss shortly next.
Let R j i (T, w) denote the number of role-disjoint copies of the length-i prefix of w, w 1 . . .w i , in the length-j prefix of T , t 1 . . .t j , and let N j i (T, w) denote the number of occurrences of the symbol w i in t 1 . . .t j .In our first combinatorial claim, we show that for every i ∈ [k] and j ∈ [n], the value of R j i (T, w) can be expressed in terms of the values of N j ′ i (T, w) for j ′ ∈ [j] (in particular, N j i (T, w)) and the values of In other words, we establish a recursive expression which implies that if we know what are R j ′ −1 i−1 (T, w) and N j ′ i (T, w) for every j ′ ∈ [j], then we can compute R j i (T, w) (and as an end result, compute R(T, w) = R n k (T, w)).In our second combinatorial claim we show that if we only want an approximation of R(T, w), then it suffices to define (also in a recursive manner) a measure that depends on the values of N j i (T, w) for every i ∈ [k] but only for a relatively small number of choices of j, which are evenly spaced.To be precise, each such j belongs to the set J = {r • γn} 1/γ r=1 for γ = Θ(δ/k).We prove that since each interval [(r − 1)γn + 1, rγn] is of size γn for this choice of γ, we can ensure that the aforementioned measure (which uses only j ∈ J) approximates R(T, w) to within O(δn).
We then prove that if we replace each N j i (T, w) for these choices of j (and for every i ∈ [k]) by a sufficiently good estimate, then we incur a bounded error in the approximation of R(T, w).Finally, such estimates are obtained using (uniform) sampling, with a sample of size Õ(k 2 /δ 2 ).

The distribution-free case
In [29,Sec. 4] it is shown that, given a word w, a text T and a distribution p, it is possible to define a word w and a text T for which the following holds.First, ∆(T, w, p) is closely related to ∆( T , w).Second, the probability of observing a copy of w in a sample selected from T according to p is closely related to the probability of observing a copy of w in a sample selected uniformly from T .
We use the first relation stated above (i.e., between ∆(T, w, p) and ∆( T , w)).However, since we are interested in distance-approximation rather than one-sided error testing, the second relation stated above (between the probability of observing a copy of w in T and that of observing a copy of w in T ) is not sufficient for our needs, and we need to take a different (once again, more algorithmic) path, as we explain shortly next.
Ideally, we would have liked to sample uniformly from T , and then run the algorithm discussed in the previous subsection using this sample (and w).However, we only have sampling access to T according to the underlying distribution p, and we do not have direct sampling access to uniform samples from T .Furthermore, since T is defined based on (the unknown) p, it is not clear how to determine the aforementioned subset of (evenly spaced) indices J.
For the sake of clarity, we continue the current exposition while making two assumptions.The first is that the distribution p is such that there exists a value β, such that p j /β is an integer for every j ∈ [n] (the value of β need not be known).The second is that in w there are no two consecutive symbols that are the same.Under these assumptions, T = t , w = w, and ∆( T , w) = ∆(T, w, p) (where t x j for an integer x is the subsequence that consists of x repetitions of t j ).
Our algorithm for the distribution-free case (working under the aforementioned assumptions), starts by taking a sample distributed according to p and using it to select a (relatively small) subset of indices in [n].Denoting these indices by b we would have liked to ensure that the weight according to p of each interval [b u−1 + 1, b u ] is approximately the same (as is the case when considering the intervals defined by the subset J in the uniform case).To be precise, we would have liked each interval to have relatively small weight, while the total number of intervals is not too large.However, since it is possible that for some single indices j ∈ [n], the probability p j is large, we also allow intervals with large weight, where these intervals consist of a single index (and there are few of them).
The algorithm next takes an additional sample, to approximate, for each i ∈ [k] and u ∈ [ℓ], the weight, according to p, of the occurrences of the symbol w i in the length-b u prefix of T .Observe that prefixes of T correspond to prefixes of T .Furthermore, the weight according to p of occurrences of symbols in such prefixes, translates to numbers of occurrences of symbols in the corresponding prefixes in T , normalized by the length of T .The algorithm then uses these approximations to obtain an estimate of ∆( T , w).
We note that some pairs of consecutive prefixes in T might be far apart, as opposed to what we had in the algorithm for the uniform case described in Section 1.2.1.However, this is always due to single-index intervals in T (for j such that p j is large).Each such interval corresponds to a consecutive subsequence in T with repetitions of the same symbol, and we show that no additional error is incurred because of such intervals.

Related results
As we have previously mentioned, the work most closely related to ours is that of Ron and Rosin on distribution-free sample-based testing of subsequence-freeness [29].For other related results on property testing (e.g., testing other properties of sequences, sample-based testing of other types of properties and distribution-free testing (possibly with queries)), see the introduction of [29], and in particular Section 1.4.For another line of work, on sublinear approximation of the longest increasing subsequence, see [26] and references within.Here we shortly discuss related results on distance approximation / tolerant testing.
As already noted, distance approximation and tolerant testing were first formally defined in [28], and were shown to be significantly harder for some properties in [17,3].Almost all previous results are query-based, and where the distance measure is with respect to the uniform distribution.These include [20,18,1,25,15,11,22,7,24,16,27].Kopparty and Saraf [23] present results for querybased tolerant testing of linearity under several families of distributions.Berman, Raskhodnikova and Yaroslavtsev [5] give tolerant (query based) L p -testing algorithms for monotonicity.Berman, Murzbulatov and Raskhodnikova [4] give a sample-based distance-approximation algorithms for image properties that works under the uniform distribution.
Canonne et al. [12] study the property of k-monotonicity of Boolean functions over various posets.A Boolean function over a finite poset domain D is k-monotone if it alternates between the values 0 and 1 at most k times on any ascending chain in D. For the special case of D = [n], the property of k-monotonicity is equivalent to being free of w of length k + 2 where w 1 ∈ {0, 1} and . One of their results implies an upper bound of O k δ 3 on the sample complexity of distance-approximation for k-monotonicity of functions f : [n] → {0, 1} under the uniform distribution (and hence for w-freeness when w is a binary subsequence of a specific form).This result generalizes to k-monotonicity in higher dimensions (at an exponential cost in the dimension d).
Blum and Hu [9] study distance-approximation for k-interval (Boolean) functions over the line in the distribution-free active setting.In this setting, an algorithm gets an unlabeled sample from the domain of the function, and asks queries on a subset of sample points.Focusing on the sample complexity, they show that for any underlying distribution p on the line, a sample of size O k δ 2 is sufficient for approximating the distance to being a k-interval function up to an additive error of δ.This implies a sample-based distribution-free distance-approximation algorithm with the same sample complexity for the special case of being free of the same pair of w's described in the previous paragraph, replacing k + 2 by k + 1.
Blais, Ferreira Pinto Jr. and Harms [8] introduce a variant of the VC-dimension and use it to prove lower and upper bounds on the sample complexity of distribution-free testing for a variety of properties.In particular, one of their results implies that the linear dependence on k in the result of [9] is essentially optimal.
Finally we mention that our procedure in the distribution-free case for constructing "almostequal-weight" intervals by sampling is somewhat reminiscent of techniques used in other contexts of testing when dealing with non-uniform distributions [6,21,10].

Further research
The main open problem left by this work is closing the gap between the upper and lower bounds that we give, and in particular understanding the precise dependence on k, or possibly other parameters determined by w (such as k d ).One step in this direction can be found in the Master Thesis of the first author [13].

Organization
In Section 2 we present our algorithm for distance-approximation under the uniform distribution.The algorithm for the distribution-free case appears in Section 3. In Section 4 we prove our lower bound.In the appendix we provide Chernoff bounds and a few proofs of technical claims.

Distance approximation under the uniform distribution
In this section, we address the problem of distance approximation when the underlying distribution is the uniform distribution.As mentioned in the introduction, Ron and Rosin showed [29,Thm. 3.4] that ∆(T, w) (the distance of T from w-freeness under the uniform distribution), equals the number of role-disjoint copies of w in T , divided by n = |T | (where role-disjoint copies are as defined in the introduction -see Section 1.2).We may use T [j] to denote the j th symbol of T (so that T [j] = t j ).
We start by introducing the following notations. 6Let R j i (T, w) denote the number of role-disjoint copies of the subsequence w 1 . . .w i in T [1, j].When i = k and j = n, we use the shorthand R(T, w) for R n k (T, w) (the total number of role-disjoint copies of w in T ).
Observe that R j 1 (T, w) equals N j 1 (T, w) for every j ∈ [n].Since, as noted above, ∆(T, w) = R(T, w)/n, we would like to estimate R(T, w).More precisely, given δ > 0 we would like to obtain an estimate R, such that: R − R(T, w) ≤ δn.To this end, we first establish two combinatorial claims.The first claim shows that the value of each R j i (T, w) can be expressed in terms of the values of Claim 2.1 For every i ∈ {2, . . ., k} and j ∈ [n], Clearly, R j i (T, w) ≤ N j i (T, w) (for every i ∈ {2, . . ., k} and j ∈ [n]), since each role-disjoint copy of w 1 . . .w i in T [1, j] must end with a distinct occurrence of w i in T [1, j].Claim 2.1 states by exactly how much is R j i (T, w) smaller than N j i (T, w).Roughly speaking, the expression max accounts for the number of occurrences of w i in T [1, j] that cannot be used in role-disjoint copies of w 1 . . .w i in T [1, j].

Proof:
For simplicity (in terms of notation), we prove the claim for the case that i = k and j = n.The proof for general i ∈ {2, . . ., k} and j ∈ [n] is essentially the same up to renaming of indices.Since T and w are fixed throughout the proof, we shall use the shorthand N j i for N j i (T, w) and R j i for R j i (T, w).For the sake of the analysis, we start by describing a simple greedy procedure, that constructs R = R n k role-disjoint copies of w in T .The correctness of this procedure follows from [29, Claim 3.5] and a simple inductive argument (for details see Appendix B).Every copy C m , for m ∈ [R] is an array of size k whose values are monotonically increasing, where for every i ∈ For i > 1 we say in such a case that the procedure matches j to the partial copy is the i-th index in the m-th greedy copy).

It is easy to verify that |G
To complete the proof, we will show that In the interval [j * ] we have N j * i occurrences of w i , and in the interval [j * − 1] we only have R j * −1 i−1 role-disjoint copies of w 1 . . .w i−1 .This implies that in the interval [j * ] there are at least N j * i − R j * −1 i−1 occurrences of w i that cannot be the i-th index of any greedy copy, and so we have On the other hand, denote by j * * the largest index in G − i .Since each index j ∈ [j * * ] such that T [j] = w i is either the i-th element of some copy or is not the i-th element of any copy, , in which case the index j * * would have to be the the i-th element of a greedy copy.Hence, In conclusion, and the claim follows.
In order to state our next combinatorial claim, we first introduce one more definition, which will play a central role in obtaining an estimate for R(T, w).Definition 2.2 For ℓ ≤ n, let N be a k × ℓ matrix of non-negative numbers, where we shall use , and for every i ∈ {2, . . ., k}, let When i = k and r = ℓ we use the shorthand M (N ) for M ℓ k (N ).
In our second combinatorial claim we show that for an appropriate choice of a matrix N , whose entries are a subset of all values in N j i (T, w) , we can bound the difference between M (N ) and R(T, w).We later use sampling to obtain an estimated version of N .
Claim 2.2 Let J = {j 0 , j 1 , . . ., j ℓ } be a set of indices satisfying Let N = N (J, T, w) be the matrix whose entries are We shall prove that for every i ∈ [k] and for every r We prove this by induction on i.
where the first equality follows from the setting of N and the definitions of M r 1 (N ) and R jr 1 (T, w).For the induction step, we assume the claim holds for i − 1 ≥ 1 (and every r ∈ [ℓ]) and prove it for i.We have, where Equation (2.5) follows from the setting of N and the definition of M r i (N ), and Equation (2.6) is implied by Claim 2.1.Denote by j * an index j ∈ [j r ] that maximizes the first max term and let b * be the largest index such that j b * ≤ j * .We have: where in Equation (2.7) we used the induction hypothesis.By combining Equations (2.6) and (2.8), we get that Similarly to Equation (2.6), Let b * * be the index b ∈ [r] that maximizes the first max term.We have Together, Equations (2.9) and (2.12) give us that and the proof is completed.
In our next claim we bound the difference between M ( N ) − M ( N ) for any two matrices (with dimensions k × ℓ), given a bound on the L ∞ distance between them.We later apply this claim with N = N for N as defined in Claim 2.2, and N being a matrix that contains estimates N r i of N jr i (T, w) (respectively).We discuss how to obtain N in Claim 2.4.
Claim 2.3 Let γ ∈ (0, 1), and let N and N be two k × ℓ matrices.If for every i ∈ [t] and r ∈ [ℓ], Proof: We shall prove that for every t ∈ [k] and for every r γn.We prove this by induction on t.For t = 1 and every r ∈ [ℓ], we have Now assume the claim is true for t − 1 ≥ 1 and for every r ∈ [ℓ], and we prove it for t.For any r ∈ [ℓ], by the definition of M r t (•), where in the last inequality we used the premise of the claim.Assume that the first max term in Equation (2.15) is at least as large as the second (the case that the second term is larger than the first is dealt with analogously), and let r * be the index that maximizes the first max term.Then, max (2.16) 7 It actually holds that M r i (N ) ≥ R jr i (T, w), so that R jr i (T, w) − M r i (N ) ≤ 0, but for the sake of simplicity of the inductive argument, we prove the same upper bound on where we used the premise of the claim once again, and the induction hypothesis.The claim follows by combining Equation (2.15) with Equation (2.16).
The next claim states that we can obtain good estimates for all values in N jr i (T, w) (with a sufficiently large sample).Its (standard) proof is deferred to Appendix B.

Distribution-free distance approximation
As noted in the introduction, our algorithm for approximating the distance from subsequencefreeness under a general distribution p works by reducing the problem to approximating the distance from subsequence-freeness under the uniform distribution.However, we won't be able to use the algorithm presented in Section 2 as is.There are two main obstacles, explained shortly next.In the reduction, given a word w and access to samples from a text T , distributed according to p, we define a word w and a text T such that if we can obtain a good approximation of ∆( T , w) then we get a good approximation of ∆(T, w, p).(Recall that ∆(T, w, p) denotes the distance of T from being w-free under the distribution p.)However, first, we don't actually have direct access to uniformly distributed samples from T , and second, we cannot work with a set J of indices that induce equally sized intervals (of a bounded size), as we did in Section 2.
We address these challenges (as well as precisely define T and w) in several stages.We start, in Sections 3.1 and 3.2, by using sampling according to p, in order to construct intervals in T that have certain properties (with sufficiently high probability).The role of these intervals will become clear in the following other subsections.

Interval construction and classification
We begin this subsection by defining intervals in [n] that are determined by p (which is unknown to the algorithm).We then construct intervals by sampling from p, where the latter intervals are in a sense approximations of the former (this will be formalized subsequently).Each constructed interval will be classified as either "heavy" or "light", depending on its (approximated) weight according to p. Ideally, we would have liked all intervals to be light, but not too light, so that their number won't be too large (as was the case when we worked under the uniform distribution and simply defined intervals of equal size).However, for a general distribution p we might have single indices j ∈ [n] for which p j is large, and hence we also need to allow heavy intervals (each consisting of a single index).We shall make use of the following two definitions.Definition 3.1 For any two integers j 1 ≤ j 2 , let [j 1 , j 2 ] denote the interval {j 1 , . . ., j 2 }.For every to be the weight of the interval [j 1 , j 2 ] according to p.We shall use the shorthand wt p (j) for wt p ([j, j]).Definition 3.2 Let S be a multiset of size s, with elements from [n].For every j ∈ [n], let N S (j) be the number of elements in S that equal j.For every j 1 , j 2 ∈ [n], define to be the estimated weight of the interval [j 1 , j 2 ] according to S. We shall use the shorthand wt S (j) for wt S ([j, j]).
In the next definition, and the remainder of this section, we shall use where let c z = 100.We next define the aforementioned set of intervals, based on p. Roughly speaking, we try to make the intervals as equally weighted as possible, keeping in mind that some indices might have a large weight, so we assign each to an interval of its own.Definition 3.3 Define a sequence of indices in the following iterative manner.Let h 0 = 0 and for ℓ = 1, 2, . . ., as long as h ℓ−1 < n, let h ℓ be defined as follows.
and for every Observe that since wt p (T ) = 1, then |H sin ∪ H med | ≤ 8z.In addition, since between each H ′ , H ′′ ∈ H sml there has to be at least one H ∈ H sin , then we also have By its definition, H is determined by p.We next construct a set of intervals B based on sampling according to p (in a similar, but not identical, fashion to Definition 3.3).Consider a sample S 1 of size s 1 selected according to p (with repetitions), where s 1 will be set subsequently.
Observe that each heavy interval consists of a single element.
In order to relate between H and B, we introduce the following event, based on the sample S 1 .

Definition 3.5 Denote by
and the claim follows.

Estimation of symbol density and weight of intervals
In this subsection we estimate the weight, according to p, of every interval [b u ] for u ∈ U , as well as its symbol density, focusing on symbols that occur in w.Note that [b u ] is the union of the intervals B 1 , . . ., B u .We first introduce some notations.
For any word w * , text and for every u where the probability is over the choice of S 2 .

Proof:
Using the additive Chernoff bound (see Theorem A.1) along with the fact that E N S 2 (j) By applying a union bound over all i ∈ [k] and u ∈ [U ], we get that with probability of at least 19  20 , ξu i − ξ u i ≤ 1 z .Another use of the additive Chernoff bound along with the fact that E Again using a union bound over all u ∈ [U ], we get that with probability of at least 19  20 we have One last use of the union bound gives us that Pr [E 2 ] ≥ 9 10

Reducing from distribution-free to uniform
In this subsection we give the details for aforementioned reduction from the distribution-free case to the uniform case, using the intervals and estimators that were defined in the previous subsections.We start by providing three definitions, taken from [29], which will be used in the reduction.The first two definitions are for the notion of splitting (variants of this notion were also used in previous works, e.g., [14]).

Definition 3.7
For a text T = t 1 . . .t n , a text T is said to be a splitting of T if T = t α 1 1 . . .t αn n for some α 1 . . .α n ∈ N + .We denote by φ the splitting map, which maps each (index of a) symbol of T to its origin in T .Formally, φ : Note that by this definition, φ is a non-decreasing surjective map, satisfying The third definition is of a set of words, where no two consecutive symbols are the same.

A basis for reducing from distribution-free to uniform
Let w be a word of length k and T a text of length n.In this subsection we establish a claim, which gives sufficient conditions on a (normalized version) of an estimation matrix N , under which it can be used to obtain an estimate of ∆( T , w) with a small additive error.We first state a claim that is similar to Claim 2.2, with a small, but important difference, that takes into account intervals in T (determined by a set of indices J) that consist of repetitions of a single symbol.Since its proof is very similar to the proof of Claim 2.2, it is deferred to Appendix B. Recall that M (•) was defined in Definition 2.2, and that R( T , w) denotes the number of role-disjoint copies of w in T .Claim 3.4 Let J = {j 0 , j 1 , . . ., j ℓ } be a set of indices satisfying j 0 = 0 < j 1 < j 2 < • • • < j ℓ = n.Let N be the matrix whose entries are N r i = N jr i ( T , w) The following observation can be easily proved by induction.Observation 3.5 Let N be a matrix of size k × ℓ.Then The next claim will serve as the basis for our reduction from the general, distribution-free case, to the uniform case.Claim 3.6 Let N be a k × ℓ matrix, J = {j 0 , j 1 , j 2 , . . ., j ℓ } be a set of indices satisfying j 0 = 0 < j 1 < j 2 < • • • < j ℓ = n and let c 1 and c 2 be constants.Suppose that the following conditions are satisfied.

For every
Then, Proof: Let N be the matrix whose entries are N r i = N jr i ( T , w) for every i ∈ [ k] and r ∈ [ℓ].We use Claim 3.4 and Item 1 in the premise of the current claim to obtain that M (N ) − R( T , w) ≤ c 1 δ n.We also use Claim 2.3 and Item 2 in the premise of the current claim to obtain that The claim follows by applying Observation 3.5 along with the fact that R( T , w) n = ∆( T , w).

Establishing the reduction for w ∈ W c and quantized p
For ease of readability, we begin by addressing the special case in which w ∈ W c (recall Definition 3.9) and where there exists β ∈ (0, 1) such that p j /β is an integer for j ∈ [n].We later show how to deal with the general case, where we rely on techniques from [29] and introduce some new ones that are needed for implementing our algorithm.
For the case considered in this subsection, let , so that p is the uniform distribution.Since p j = β • α j , for every j ∈ [n], we get that ( T , p) is a splitting of (T, p) (recall Definition 3.8), and hence by [29,Clm. 4.4] (using the assumption that w ∈ W c ), ∆( T , w, p) = ∆(T, w, p) .
We next introduce a notation for the weights, according to p, of unions of these intervals.For every Note that where ξ u i is as defined in Equation (3.11). Proof: I j i ( T , w) p j = j∈ bu and the claim is established.
We can now state and prove the following lemma.As in the uniform case, the running time of the algorithm is linear in the size of the sample.
Proof: The algorithm first takes a sample S 1 of size s 1 = 120z log(240z) and constructs a set of intervals B as defined in Definition 3.4.Next the algorithm takes another sample, S 2 , of size s 2 = z 2 log(40kU ) according to which it defines an estimation matrix ξ of size k × U as follows.For every , where ξu i is as defined in Equation (3.12).Lastly the algorithm outputs ∆ = M ( ξ), where M is as defined in Definition 2.2.
We would like to apply Claim 3.6 in order to show that | ∆ − ∆( T , w)| ≤ δ with probability of at least 2  3 .By the setting of s 1 , applying Claim 3.1 gives us that with probability at least 8 10 , the event E 1 , as defined in Definition 3.5, holds.By the setting of s 2 , applying Claim 3.3 gives us that with probability at least 9  10 , the event E 2 , as defined in Definition 3.6, holds.We henceforth condition on both events (where they hold together with probability at least 7/10).
In order to apply Claim 3.6, we set w = w, J = b 0 , b 1 , . . ., b U (recall Definition 3.10) and N = n ξ, for ξ as defined above.Also, we set c 1 = 1 2 and c 2 = 1 4 .We next show that both items in the premise of the claim are satisfied.
To show that Item 1 is satisfied, we first note that since p is uniform, then for every u ∈ U , wt p (b u ) = bu− b u−1 n .We use the consequence of Claim 3.2 (recall that we condition on E 1 ) by which for every u such that bu− b u−1 n ≥ 6 z , B u is heavy (since for every u ∈ U , wt p ( B u ) = wt p (B u )).By Definition 3.4 this implies that B u contains only one index, and so . By the definition of z (Equation (3.1)) and the setting of c 1 , the item is satisfied.
To show that Item 2 is satisfied, we use the definition of E 2 (Definition 3.6, Equation (3.13)) together with Claim 3.7, which give us . By Equation (3.22), the definition of z and the setting of c 2 , we get that the item is satisfied.
In the next subsections we turn to the general case where we do not necessarily have that w ∈ W c or that there exists a value β such that for every j ∈ [n], p j /β is an integer.In this general case we need to take some extra steps until we can perform a splitting.Beginning with performing a reduction to a slightly different distribution, then performing a reduction to w ∈ W c .While this follows [29], for the sake of our algorithm, along the way we need to show how to define the estimation matrix ξ (= N / n) and the corresponding set of indices J so that we can apply Claim 3.6, similarly to what was shown in the proof of Lemma 3.8.

Quantized distribution
Let η = c η 1 nz , where c η = 1  16 .We define p by "rounding" p j for every j ∈ [n] to its nearest larger integer multiple of η.Namely, pj = Claim 3.9 For every i ∈ [k] and u ∈ [U ], and for every u Proof: Equation (3.28) follows by using the triangle inequality along with the fact the for every

Dealing with w / ∈ W c
We would have liked to consider a ṗ-splitting of T = t 1 t 2 . . .t n and then use the relationship between the distance from w-freeness before and after the splitting.However, we only know this connection between the distances in the case of w ∈ W c .Hence, we shall apply a reduction from a general w to w ∈ W c , as was done in [29], in their proof of Lemma 4.8.Without loss of generality, assume 0 is a symbol that does not appear in w or T (if there is no such symbol in Σ, then we extend Σ to Σ ∪ {0}).Let w ′ = w 1 0w 2 0 . . .w k−1 0w k 0, T ′ = t 1 0t 2 0 . . .t n 0 and p ′ = ( ṗ1 /2, ṗ1 /2, . . ., ṗn /2, ṗn /2).
Note that w ′ is in W c .By [29,Clm. 4.6], Here too we define a set of intervals of [2n].
and define end if 12: end while 13: U ′ = max f −1 (U ) Intuitively, Algorithm 1 makes sure that 0's that come after what is a heavy interval in T become a single-index interval themselves in T ′ .On the other hand, the rest of the 0's are joined to the light interval that includes their left neighbour in T , to form a new interval, which will have the same weight as the light interval in T .It also sets the function f that maps intervals in T ′ to their corresponding intervals in T .

Estimators for the distribution-free case
For every i ∈ [2k] and u ∈ [U ′ ], let x(u, i) take the following values Define the following estimator.For every i ∈ [2k] and u ∈ [U ′ ] For the next claim, recall that the event E 2 was defined in Definition 3.6.
Using Equation (3.25) and since we conditioned on E 2 we get the desired inequality.
We prove another claim to establish a connection between ∆( T , w ′ , p) and ∆(T, w, p).

Wrapping things up in the general case
We can now restate and prove the main theorem of this section (as it appeared in the introduction).Note that if δ ≥ 1/k d , then the algorithm can simply output 0. This is true since the number of role disjoint copies of w in T is at most the number of occurrences of the symbol in w that is least frequent in T .This number is upper bounded by n k d , and so the distance from w-freeness is at most In this case no sampling is needed, so only the trivial lower bound holds.The proof will deal with the case of δ ∈ (0, 1 300k d ].Proof: The proof is based on the difficulty of distinguishing between an unbiased coin and a coin with a small bias.Precise details follow.
Let V = {v 1 , . . ., v k d } be the set of distinct symbols in w, and let 0 be a symbol that does not belong to V .We define two distributions over texts, T 1 and T 2 as follows.For each τ ∈ [ n k d ] and ρ ∈ [0, 1], let λ τ ρ be a random variable that equals 0 with probability ρ and equals v 1 with probability 1 − ρ.Let δ ′ = 3k d δ and consider the following two distributions over texts Namely, the supports of both distributions contain texts that consist of n/k d blocks of size k d each.
For i ∈ {2, . . ., k d }, the i-th symbol in each block is v i .The distributions differ only in the way the first symbol in each block is selected.In T 1 it is 0 with probability 1/2 and v 1 with probability 1/2, while in T 2 it is 0 with probability 1/2 + δ ′ = 1/2 + 3δk d , and and Assume, contrary to the claim, that we have a sample-based distance-approximation algorithm for subsequence-freeness that takes a sample of size Q(k d , δ) = 1/(ck d δ 2 ), for some sufficiently large constant c, and outputs an estimate of the distance to w-freeness that has additive error at most δ, with probability at least 2/3.Consider running the algorithm on either T 1 ∼ T 1 or T 2 ∼ T 2 .Let L denote the number of times that the sample landed on an index of the form j = ℓ • k d + 1 for an integer ℓ.By Markov's inequality, the probability that L > ).(In both cases the probability is taken over the selection of T b ∼ T b , the sample that the algorithm gets, and possibly additional internal randomness of the algorithm.)Based on the definitions of T 1 and T 2 , this implies that it is possible to distinguish between an unbiased coin and a coin with bias 3k d δ with probability at least 2/3−1/100−1/10 > 8 15 , using a sample of size 1 c ′ k 2 d δ 2 in contradiction to the result of Bar-Yosef [2, Thm.8] (applied with m = 2, ǫ = 3k d δ.Since we have δ < 1 300k d , then ǫ < 1 96 , as the cited theorem requires).
w in T is a sequence of ordered role-disjoint copies if for every r ∈ [m − 1] we have it that C r+1 succeeds C r .By [29, Claim 3.5], for every set of role-disjoint copies of w in T , there exists a sequence of ordered role-disjoint copies of w in T with the same size.Since the greedy algorithm described in the proof of Claim 2.1 finds a sequence of ordered role-disjoint copies of w in T , it remains to show that there is no other longer (larger) sequence of ordered role-disjoint copies of w of T .
Denote by C = (C 1 , . . ., C |C| ) the sequence of ordered role-disjoint copies of w in T that is found by the greedy algorithm.Assume, contrary to the claim that there is a longer sequence, C = ( C 1 , . . ., C | C| ) of ordered role-disjoint copies of w in T .In what follows we show, by induction on m and i, that which will imply a contradiction to the counter assumption.
For every m ∈ [|C|] and for i = 1, by the definition of the greedy algorithm, C m [1] is the index of the mth occurrence of w In order to prove the claim for (m, i) where i > 1, we assume by induction that it holds for (m, i− 1) and for (m − 1, i), where for the sake of the argument (so that C m−1 and C m−1 are defined also for m = 1) we define C 0 Proof of Claim 2.4: Let s = log(6k•ℓ) 2γ 2 .We take s samples from [n] selected uniformly, independently at random (allowing repetitions).Denote the q-th sampled index by ρ q .For every i ∈ [k], r ∈ [ℓ] and q ∈ [s], define the random variables χ i,r q to equal 1 if and only if ρ q ∈ [j r ] and T [ρ q ] = w i , Otherwise χ i,r q = 0.For every i ∈ [k] and r ∈ [ℓ], set Proof of Claim 3.4: For the sake of simplicity, we use T and w instead of T and w, respectively.Recall that M (N ) = M ℓ k (N ) and R(T, w) = R j ℓ k (T, w).We shall prove that for every i ∈ [k] and for every r ∈ [ℓ], M r i (N ) − R jr i (T, w) ≤ (i − 1) • max τ ∈[r]\J ′ {j τ − j τ −1 }.We prove this by induction on i.where the first equality follows from the setting of N and the definitions of M r 1 (N ) and R jr 1 (T, w).For the induction step, we assume the claim holds for i − 1 ≥ 1 (and every r ∈ [ℓ]) and prove it for i.We have, and the proof is completed. 9It actually holds that M r i (N ) ≥ R jr i (T, w), so that R jr i (T, w) − M r i (N ) ≤ 0, but for the sake of simplicity of the inductive argument, we prove the same upper bound on R jr i (T, w) − M r i (N ) as on M r i (N ) − R jr i (T, w).

Theorem 1 . 1
There exists a sample-based distance-approximation algorithm for subsequencefreeness under the uniform distribution, that takes a sample of size Θ k 2 δ 2 • log k δ and outputs an estimate ∆ such that | ∆ − ∆(T, w)| ≤ δ with probability at least 2/3. 8hile our focus is on the sample complexity of the algorithm, we note that its running time is linear in the size of the sample.Proof: The algorithm sets γ = δ/(3k) and J = {γn, 2γn, . . ., n}.It first applies Claim 2.4 with the above setting of γ to obtain the estimates N r i for every i ∈ [k] and r ∈ [ℓ], which with probability at least 2/3 are as stated in Equation (2.17).If we take N = N for N as defined in Claim 2.2, then the premise of Claim 2.3 holds.We can hence apply Claim 2.3, and combining with Claim 2.2 and the definition of J, we get that with probability at least 2/3, for the matrix N , M ( N ) − R(T, w) ≤ (2k − 1)γn + (k − 1)γn = (3k − 2)γn ≤ δn .

Definition 3 . 4
Given a sample S 1 (multiset of elements in [n]) of size s 1 , determine a sequence of indices in the following iterative manner.Let b 0 = 0 and for u = 1, 2, . . ., as long as b u−1 < n, let b u be defined as follows.If wt S 1 (b

Lemma 3 . 8
Let w be a word of length k in W c , T a text of length n, and p a distribution over [n] for which there exists β ∈ (0, 1) such that p j /β is an integer for every j ∈ [n].There exists an algorithm that, given a parameter δ ∈ (0, 1), takes a sample of size Θ k 2 δ 2 • log k δ from T , distributed according to p, and outputs an estimate ∆ such that | ∆ − ∆(T, w, p)| ≤ δ with probability at least 2/3.

Claim 3 . 13
Conditioned on the event E 2 , for every i ∈ [2k] and u ∈ [U ′ ] .40) Proof: Using the triangle inequality, along with Claim 3.9, Observation 3.10 and Claim 3.11, we get that for every i ∈ [2k] and u ∈ [U ′ ]

Theorem 1. 2
There exists a sample-based distribution-free distance-approximation algorithm for subsequence-freeness, that takes a sample of size Θ k 2 δ 2 • log k δ from T , distributed according to an unknown distribution p, and outputs an estimate ∆ such that | ∆ − ∆(T, w, p)| ≤ δ with probability at least 2 3 .on δ ≤ 1 300k d and n > max 8k δ , 200 k d δ 2 .

v 1 with probability 1 / 2 −
δ ′ .For b ∈ {1, 2}, consider selecting a text T b according to T b (denoted by T b ∼ T b ), and let O b be the number of occurrences of v 1 in the text (so that O b is a random variable).Observe that E[O 1 ] = n 2k d and E[O 2 ] = n 2k d − 3δn.By applying the additive Chernoff bound (Theorem A.1) and using the premise of the theorem regarding n, [i] = C 0 [i] = −k + i.By the induction hypothesis, C m [i − 1] ≤ C m [i − 1] and C m−1 [i] < C m−1 [i].Because indices of a copy are always strictly increasing, C m [i − 1] < C m [i], and since C is ordered, C m−1 [i] < C m [i].Therefore, C m [i − 1] < C m [i] and C m−1 [i] < C m [i].By the definition of the algorithm, C m [i] is the index of the first occurrence of w i following C m [i − 1] that is larger than C m−1 [i].Since T [ C m [i]] = w i we get that C m [i] ≤ C m [i], as claimed.Finally, by the counter assumption, | C| > |C|.By what we have shown above, this implies that C m [i] < C |C|+1 [i] for every m ∈ [|C|], and i ∈ [k].But this contradicts the fact that the algorithm did not find any role-disjoint copy after C |C| .