Improved Bounds for Codes Correcting Insertions and Deletions

This paper studies the cardinality of codes correcting insertions and deletions. We give improved upper and lower bounds on code size. Our upper bound is obtained by utilizing the asymmetric property of list decoding for insertions and deletions and can be seen as analogous to the Elias bound in the Hamming metric. Our non-asymptotic bound is better than the existing bounds when the minimum Levenshtein distance is relatively large. The asymptotic bound exceeds the Elias and the MRRW bounds adapted from the Hamming-metric bounds for the binary and the quaternary cases. Our lower bound improves on the bound by Levenshtein, but its effect is limited and vanishes asymptotically.


Introduction
We study the existence of codes correcting insertions and deletions. We are interested in deriving upper and lower bounds on the cardinality of codes.
Levenshtein [9] gave asymptotic upper and lower bounds for codes correcting a constant number of insertions and deletions. Later, he presented bounds for correcting any number [10]. Following his work, there have been several studies [3,7,8] on the cardinality of codes. However, they mainly focused on codes correcting a constant number of insertions/deletions. Giving better bounds for codes correcting a constant fraction of insertions/deletions has been elusive. See [2] for a recent survey.
In this work, we present upper and lower bounds on the cardinality of codes correcting insertions and deletions. First, we give a non-asymptotic upper bound on the cardinality of codes correcting insertions/deletions. Asymptotically, it implies that for any code C ⊆ Σ n of rate R that can correct δn insertions/deletions, it holds that R ≤ (1 − H q (δ))/(1 − δ), where |Σ| = q, δ ∈ [0, 1), and H q (·) is the q-ary entropy function. The bound improves on the asymptotic upper bounds from the literature; The well-known Elias and MRRW bounds for the Hamming metric can also be employed as upper bounds in the Levenshtein metric (insertions and deletions). Our bound improves on them for q = 2, 4.
Our bound is obtained by a similar argument to the Elias bound in the Hamming metric. We use the list size upper bound of [6] for insertions and deletions. It is well-known that any s-deletion correcting code can correct any s 1 insertions and s 2 deletions with s 1 + s 2 = s. This symmetry in the unique decoding regime does not hold in the list decoding problem. In [6], it is proved that any code with a large Levenshtein distance enables list decoding such that the decoding radius of insertion is superior to that of deletion. We crucially use this property to derive our upper bound.
Next, we give a non-asymptotic lower bound on the cardinality of codes. Our bound improves on the bound by Levenshtein [10] by investigating that every deletion ball contains multiple words that are close to each other in the Levenshtein distance. Asymptotically, our lower bound is the same as in [10].
Finally, we compare our bounds with the existing bounds in the literature. As a nonasymptotic bound, our upper bound is tighter than the bounds in [8,10] when the minimum distance is relatively large. Our asymptotic upper bound is the best compared with the existing bounds from the Hamming metric for the binary and the quaternary codes. Regarding lower bounds, although our bound strictly improves the bound in [10], its effect is limited and vanishes asymptotically.

Related Work
Cullina and Kiyavash [3] improved Levenshtein's upper bound [10] for correcting a constant number of insertions/deletions using a graph-theoretic approach. Kulkarni and Kiyavash [8] derived non-asymptotic upper bounds by a linear programming argument for graph-matching problems. They also gave upper bounds for correcting a constant fraction of insertions/deletions. Although their asymptotic bound (rate function) improved on the bound in [10], it was not given in the closed-form expressions.
For extreme cases where the deletion fraction is either small or high, Guruswami and Wang [5] gave efficient constructions of codes correcting these cases of deletions. For the case that the coding rate is nearly zero, Kash et al. [7] showed a positive-rate binary code correcting the fraction p of insertions/deletions with p ≥ 0.1737, which improved on the bound of p ≥ 0.1334 in [10]. Bukh, Guruswami, and Haståd [1] significantly improved on the previous results by showing the existence of a positive rate q-ary code with p ≥ 1 − (2/(q + √ q)), which is p ≥ 0.4142 for the binary case. Guruswami, He, and Li [4] showed a slight but highly non-trivial improvement to the upper bound on the fraction of correctable insertions/deletions by codes with a non-zero coding rate. They proved that there is a constant > 0 such that any q-ary code correcting a (1 − (1 + )/q)-fraction of insertions/deletions must have a rate approaching 0.

Code Size Upper Bound
Let Σ be a finite alphabet. The Levenshtein distance d L (x, y) between two words x and y is the minimum number of symbol insertions and deletions needed to transform x into y. For a code C ⊆ Σ n , its minimum Levenshtein distance is the minimum distance d L (c 1 , c 2 ) of every pair of distinct codewords c 1 , c 2 ∈ C. Since any two codewords in C are of the same length, the minimum Levenshtein distance of C takes an even number. It is well-known that a code with minimum Levenshtein distance d can correct any s 1 insertions and s 2 deletions as long as s 1 + s 2 ≤ d/2 − 1. The Levenshtein distance between two words in Σ n takes integer values from 0 to 2n. Thus, we consider the normalized Levenshtein distance δ = d/2n in the analysis. The value δ ∈ [0, 1] also represents the fraction of insertions/deletions that can be corrected since we require (s 1 + s 2 )/n ≤ (d/2 − 1)/n = δ − 1/n, which is asymptotically equal to δ. Let C ⊆ Σ n be a code of minimum Levenshtein distance d with |Σ| = q. For a word x ∈ Σ n , let I t (x) ⊆ Σ n+t be the set of its supersequences of length n + t. Namely, I t (x) consists of words obtained from x by inserting t symbols. Similarly, let D t (x) be the set of words obtained from x by deleting t symbols. It is known that the size of I t (x) does not depend on x and First, we give a simple sphere packing bound. We use the fact that the number of supersequences, |I t (x)|, is independent of x. Theorem 1. Let C ⊆ Σ n be a code of minimum Levenshtein distance d and |Σ| = q. It holds that Proof. For each codeword c ∈ C, consider the set of supersequences Since the code has minimum Levenshtein distance d, each The statement follows by the equality Next, we prove our main theorem, which can be seen as an Elias-type upper bound on the code size in the Levenshtein metric.
it holds that . (3) Proof. By double counting, it holds that By considering the intersection with C, Thus, by choosing y ∈ Σ n+t uniformly at random, The averaging argument implies that there exists y ∈ Σ n+t such that We have the following lemma.
Lemma 1. For any non-negative integer t with t < nd/(2n − d) and y ∈ Σ n+t , it holds that Proof. Note that |D t (y)∩C| can be seen as a list size when list decoding of C is applied to y, where y is a received word after t insertions. Thus, the statement follows from [6, Lemma 1] by setting t I = t and t D = 0.
By combining (5) and Lemma 1, the statement follows.
We analyze asymptotics of Theorems 1 and 2. For a code C ⊆ Σ n of distance d, let be the asymptotic coding rate achievable for normalized Levenshtein distance δ. Note that R q (δ) = 0 for δ ≥ 1 − q −1 .
Let Vol q (n, ) be the volume of the Hamming ball of radius in F n q . Namely, Vol q (n, ) = i=0 n i (q − 1) i . It is well-known (cf. [11,Lemma 4.8]) that, for 0 ≤ ≤ n, Regarding Theorem 1, it holds that By Theorem 1, the rate R of C is bounded above by Thus, we have the following corollary.
Regarding Theorem 2, condition (2) can be rewritten as Let γ = δ/(1 − δ) − 1/n. The bound (3) can be rewritten as the rate R of C is bounded above by We obtain the following corollary.

Bounds from Hamming-Metric Bounds
For a code C ⊆ Σ n , let d h be the minimum Hamming distance of C. As far as we know, the best-known upper bounds for coding rate with respect to normalized minimum Levenshtein distance are obtained by the bounds for Hamming metric. The following bounds are well-known in the literature.

Code Size Lower Bounds
Next, we consider lower bounds on A q (n, d) for codes C ⊆ Σ n with |Σ| = q. We assume that d is even. For x ∈ Σ n and non-negative integers t, s with t ≤ n, let L t,s (x) be the set of words that can be obtained from x by deleting t symbols and inserting s symbols. By definition, it holds that L t,0 (x) = D t (x) and L 0,s (x) = I s (x). We would like to derive an upper bound on the average size of L t,t (x) for x ∈ Σ n and t ≤ n. This is because it gives a lower bound on A q (n, d) as discussed in [10,12]. More specifically, for any X ⊆ Σ n , let be the average size of L t,t (x) in X. Then, there exists a code C ⊆ Σ n of minimum Levenshtein distance d satisfying For x = (x 1 , . . . , x n ) ∈ Σ n and i ∈ [0, n − 1], we say ( We observe that for any positive integer p ≤ p(x) satisfying p ≤ min{t, n − t}, there are 2 p words in D t (x) such that they are within a Levenshtein distance of 2p each other. The reason is as follows. Since p ≤ p(x), there are n − 2p indices in {1, . . . , n} such that they are not contained in a disjoint index-pair set of distinct adjacent pairs in x. First, we delete t−p symbols from x out of the n−2p indices. This procedure requires that p ≤ min{t, n−t}. Let y ∈ Σ n−t+p be the resulting word. Second, we delete one of the two symbols in p distinct adjacent pairs from y. There are 2 p possible deletion patterns, resulting in all different words of length n − t. Since each resulting word and y have Levenshtein distance p, the 2 p words are within a Levenshtein distance of 2p. In other words, D t (x) contains 2 p words {z 1 , . . . , z 2 p } such that all are subsequences of y ∈ Σ n−t+p . We have Let Z = {z 1 , . . . , z 2 p }. For any integer p ≤ p(x) with p ≤ min{t, n − t}, it holds that which is the best possible upper bound on |L t,t (x)| among the choices of p. We definẽ p(n, q, t, p) = arg max p ∈[0,min{p,t,n−t}]∩N which defines the best choice for p. Then we haveL For integers n and p ∈ [0, n/2], let N n,q (p) be the number of words x ∈ Σ n such that p(x) = p, where q = |Σ|. We have that wherep =p(n, q, t, p) and the last equality follows from (4). We determine N n,q (p) for p ∈ [0, n/2].
Proof. We count the number of words x ∈ Σ n such that p(x) = p. Let (i 1 , i 1 +1), . . . , (i p , i p + 1) ∈ {1, . . . , n} 2 be p index-pairs of distinct adjacent pairs such that i 1 ≤ · · · ≤ i p . Let (a 1 , a 1 ), . . . , (a p , a p ) ∈ Σ 2 be the corresponding distinct adjacent pairs, where their concrete values are yet determined. Since the order of p pairs (a i , a i ) is fixed, we can construct all words x with p(x) = p by inserting n−2p symbols into the word a 1 a 1 a 2 a 2 · · · a p a p . There are p + 1 possible places to which symbols can be inserted. Namely, the resulting word should be of the form w 1 a 1 a 1 w 2 a 2 a 2 w 3 · · · w p a p a p w p+1 , where w i ∈ Σ * . Here, we consider the leftmost maximum-sized disjoint index-pair set in x. By the property of distinct adjacent pairs, w 1 must be the empty string or repetition of a 1 . Similarly, for i ∈ {2, . . . , p}, w i must be the empty string or repetition of a i . Note that, different from the previous cases, w p+1 can be the empty string or repetition of any fixed symbol in Σ.
First, suppose that w p+1 is the empty string. Then the number of words x with p(x) = p is determined by the number of possible lengths of w 1 , . . . , w p and possible combinations (a i , a i ). Since the rightmost symbols are fixed to be a p a p , the number of possible lengths is equal to the number of permutations of (n − 2p) + p items where there are n − 2p and p identical items, which is given by (n − 2p + p)!/((n − 2p)!p!) = n−p p p/(n − p). There are q(q − 1) combinations for each (a i , a i ). Hence, the number N n,q (p) when |w p+1 | = 0 is equal to q p (q − 1) p n−p p p/(n − p). Second, consider the case that |w p+1 | > 0. The number of possible lengths of w 1 , . . . , w p is equal to the number of permutations of (n − 2p − 1) + p items where there are n − 2p − 1 and p identical items, which is equal to (n − 2p − 1 + p)!/((n − 2p − 1)!p!) = n−p p (n − 2p)/(n − p). There are q possible symbols corresponding to w p+1 . As in the previous case, the number of combinations for (a i , a i )'s are (q(q − 1)) p . Thus, the number N n,q (p) when |w p+1 | > 0 is q p+1 (q − 1) p n−p p (n − 2p)/(n − p). Therefore, we have By using (7) for X = Σ n , (8), and Theorem 4, we have the following theorem.
The expression in Theorem 5 needs optimization ofp. Here, we give a closed-form expression by choosingp = 1 for p ≥ 1.

Conclusions
This paper has presented improved upper and lower bounds on code size for correcting insertions and deletions. In particular, our upper bound improves the existing bounds in both non-asymptotic and asymptotic senses. An interesting future work is to develop an upper bound superior to the MRRW bound for large alphabet size q and relative distance δ. Our bound is inferior in some range for q ≥ 5. A key lemma may be a list-size upper bound of insertions/deletion (as Lemma 1) employing q explicitly in the bound.