Skip to main content

Bounds and Estimates on the Average Edit Distance

  • Conference paper
  • First Online:
Book cover String Processing and Information Retrieval (SPIRE 2019)

Abstract

The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let \(e_k(n)\) denote the average edit distance between random, independent strings of n characters from an alphabet of a given size k. An open problem is the exact value of \(\alpha _{k}(n)= e_k(n)/n\). While it is known that, for increasing n, \(\alpha _{k}(n)\) approaches a limit \(\alpha _{k}\), the exact value of this limit is unknown, for any \(k\ge 2\). This paper presents an upper bound to \(\alpha _{k}\) based on the exact computation of some \(\alpha _k(n)\) and a lower bound to \(\alpha _{k}\) based on combinatorial arguments on edit scripts. Statistical estimates of \(\alpha _{k}(n)\) are also obtained, with analysis of error and of confidence intervals. The techniques are applied to several alphabet sizes k. In particular, for a binary alphabet, the rigorous bounds are \(0.1742 \le \alpha _2 \le 0.3693\) while the obtained estimate is \(\alpha _2 \approx 0.2888\); for a quaternary alphabet, \(0.3598 \le \alpha _4 \le 0.6318\) and \(\alpha _4 \approx 0.5180\). These values are more accurate than those previously published.

This work was partially supported by University of Padova projects CPDA152255/15 and CPGA3/13; by MIUR, the Italian Ministry of Education, University and Research, under Grant 20174LF3T8 AHeAD: efficient Algorithms for HArnessing networked Data; and by an IBM SUR Grant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A similar algorithm computes the length of the LCS. The recurrence (3) becomes \(M_{i,0} = 0\), \(M_{0,j} = 0\), and \(M_{i,j} = \max {\{ M_{i-1,j-1} + (1-\xi _{i,j}) ; M_{i-1,j} ; M_{i,j-1} \}}.\)

References

  1. Abboud, A., Backurs, A., Williams, V.V.: Tight hardness results for LCS and other sequence similarity measures. In: 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 59–78 (2015). https://doi.org/10.1109/FOCS.2015.14

  2. Andoni, A., Krauthgamer, R., Onak, K.: Polylogarithmic approximation for edit distance and the asymmetric query complexity. In: 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 377–386 (2010). https://doi.org/10.1109/FOCS.2010.43

  3. Backurs, A., Indyk, P.: Edit distance cannot be computed in strongly subquadratic time (unless seth is false). In: Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing, pp. 51–58. STOC 2015, ACM, New York, NY, USA (2015). https://doi.org/10.1145/2746539.2746612

  4. Baeza-Yates, R.A., Gavaldà, R., Navarro, G., Scheihing, R.: Bounding the expected length of longest common subsequences and forests. Theor. Comput. Syst. 32(4), 435–452 (1999). https://doi.org/10.1007/s002240000125

    Article  MathSciNet  MATH  Google Scholar 

  5. Bundschuh, R.: High precision simulations of the longest common subsequence problem. Eur. Phys. J. B - Condens. Matter Complex Syst. 22(4), 533–541 (2001). https://doi.org/10.1007/s100510170102

    Article  Google Scholar 

  6. Calvo-Zaragoza, J., Oncina, J., de la Higuera, C.: Computing the expected edit distance from a string to a probabilistic finite-state automaton. Int. J. Found. Comput. Sci. 28(05), 603–621 (2017). https://doi.org/10.1142/S0129054117400093

    Article  MathSciNet  MATH  Google Scholar 

  7. Chakraborty, D., Das, D., Goldenberg, E., Koucky, M., Saks, M.: Approximating edit distance within constant factor in truly sub-quadratic time. In: 2018 IEEE 59th Annual Symposium on Foundations of Computer Science, pp. 979–990 (2018). https://doi.org/10.1109/FOCS.2018.00096

  8. Chvátal, V., Sankoff, D.: Longest common subsequences of two random sequences. J. Appl. Probab. 12(2), 306–315 (1975). https://doi.org/10.2307/3212444

    Article  MathSciNet  MATH  Google Scholar 

  9. Dancík, V.: Expected length of longest common subsequences. Ph.D. thesis, University of Warwick (1994)

    Google Scholar 

  10. Ganguly, S., Mossel, E., Racz, M.Z.: Sequence assembly from corrupted shotgun reads. arXiv preprint arXiv:1601.07086 (2016)

  11. Lueker, G.S.: Improved bounds on the average length of longest common subsequences. J. ACM 56(3), 17:1–17:38 (2009). https://doi.org/10.1145/1516512.1516519

    Article  MathSciNet  MATH  Google Scholar 

  12. Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. Syst. Sci. 20(1), 18–31 (1980). https://doi.org/10.1016/0022-0000(80)90002-1

    Article  MathSciNet  MATH  Google Scholar 

  13. Ning, K., Choi, K.P.: Systematic assessment of the expected length, variance and distribution of longest common subsequences. arXiv preprint arXiv:1306.4253 (2013)

  14. Rubinstein, A.: Hardness of approximate nearest neighbor search. In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1260–1268. STOC 2018, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3188745.3188916

  15. Rubinstein, A., Song, Z.: Reducing approximate longest common subsequence to approximate edit distance. arXiv preprint arXiv:1904.05451 (2019)

  16. Saw, J.G., Yang, M.C.K., Mo, T.C.: Chebyshev inequality with estimated mean and variance. Am. Stat. 38(2), 130–132 (1984). https://doi.org/10.1080/00031305.1984.10483182

    Article  MathSciNet  Google Scholar 

  17. Spencer, J.: Asymptopia. Am. Math. Soc., 71 (2014)

    Google Scholar 

  18. Steele, J.M.: Probability Theory and Combinatorial Optimization. SIAM, Philadelphia (1997)

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michele Schimd .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Schimd, M., Bilardi, G. (2019). Bounds and Estimates on the Average Edit Distance. In: Brisaboa, N., Puglisi, S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science(), vol 11811. Springer, Cham. https://doi.org/10.1007/978-3-030-32686-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-32686-9_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-32685-2

  • Online ISBN: 978-3-030-32686-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics