Advertisement

Cybernetics and Systems Analysis

, Volume 55, Issue 5, pp 860–878 | Cite as

Index Structures for Fast Similarity Search for Symbol Strings

  • D. A. RachkovskijEmail author
NEW MEANS OF CYBERNETICS, INFORMATICS, COMPUTER ENGINEERING, AND SYSTEMS ANALYSIS
  • 4 Downloads

Abstract

This article surveys index structures for fast similarity search for objects represented by symbol strings. Index structures both for exact and approximate searches by edit distance are considered. Index structures based on inverted indexing, similarity preserving hashing, and treelike structures are mainly presented. Ideas of well-known and recently proposed algorithms are described.

Keywords

similarity search edit distance nearest neighbor near neighbor index structure inverted indexing n-gram locality-sensitive hashing treelike structure 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    D. A. Rachkovskij, “Real-valued vectors for fast distance and similarity estimation,” Cybernetics and Systems Analysis, Vol. 52, No. 6, 967–988 (2016).MathSciNetzbMATHGoogle Scholar
  2. 2.
    D. A. Rachkovskij, “Binary vectors for fast distance and similarity estimation,” Cybernetics and Systems Analysis, Vol. 53, No. 1, 138–156 (2017).MathSciNetzbMATHGoogle Scholar
  3. 3.
    D. A. Rachkovskij, “Distance-based index structures for fast similarity search,” Cybernetics and Systems Analysis, Vol. 53, No. 4, 636–658 (2017).MathSciNetzbMATHGoogle Scholar
  4. 4.
    D. A. Rachkovskij, “Index structures for fast similarity search for binary vectors,” Cybernetics and Systems Analysis, Vol. 53, No. 5, 799–820 (2017).MathSciNetzbMATHGoogle Scholar
  5. 5.
    D. A. Rachkovskij, “Index structures for fast similarity search for real-valued vectors. I,” Cybernetics and Systems Analysis, Vol. 54, No. 1, 152–164 (2018).MathSciNetzbMATHGoogle Scholar
  6. 6.
    D. A. Rachkovskij, “Index structures for fast similarity search for real-valued vectors. II,” Cybernetics and Systems Analysis, Vol. 54, No. 2, 320–335 (2018).MathSciNetzbMATHGoogle Scholar
  7. 7.
    L. Boytsov, “Indexing methods for approximate dictionary searching: Comparative analysis,” J. Exp. Algorithmics, Vol. 16, 1.1:1–1.1:91 (2011).MathSciNetzbMATHGoogle Scholar
  8. 8.
    Y. Jiang, G. Li, J. Feng, and W. Li, “String similarity joins: An experimental evaluation,” Proc. VLDB Endowment, Vol. 7, No. 8, 625–636 (2014).Google Scholar
  9. 9.
    M. Yu, G. Li, D. Deng, and J. Feng, “String similarity search and join: A survey,” Frontiers of Computer Science, Vol. 10, No. 3, 399–417 (2016).Google Scholar
  10. 10.
    A. Backurs and P. Indyk, “Edit distance cannot be computed in strongly subquadratic time (unless SETH is false),” in: Proc. STOC’15 (2015), pp. 51–58.Google Scholar
  11. 11.
    A. Andoni and P. Indyk, “Nearest neighbors in high-dimensional spaces,” in: Handbook of Discrete and Computational Geometry, 3rd Edition, Chap. 43, CRC Press, Boca Raton, USA (2017), pp. 1133–1153.Google Scholar
  12. 12.
    A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” Comm. ACM, Vol. 51, No. 1, 117–122 (2008).Google Scholar
  13. 13.
    W. Mann, N. Augsten, and P. Bouros, “An empirical evaluation of set similarity join techniques,” Proc. VLDB Endow, Vol. 9, No. 9, 636–647 (2016).Google Scholar
  14. 14.
    L. Jia, L. Zhang, G. Yu, J. You, J. Ding, and M. Li, “A survey on set similarity search and join,” International Journal of Performability Engineering, Vol. 14, No. 2, 245–258 (2018).Google Scholar
  15. 15.
    U. Manber and S. Wu, “An algorithm for approximate membership checking with application to password security,” Inf. Process. Lett. Vol. 50, No. 4, 191–197 (1994).zbMATHGoogle Scholar
  16. 16.
    I. Chegrane and D. Belazzougui, “Simple, compact and robust approximate string dictionary,” J. Discrete Algorithms, Vol. 28, 49–60 (2014).MathSciNetzbMATHGoogle Scholar
  17. 17.
    D. Belazzougui, “Faster and space-optimal edit distance ”1" dictionary," in: Proc. CPM’09 (2009), pp. 154–167.Google Scholar
  18. 18.
    D. Belazzougui and R. Venturini, “Compressed string dictionary search with edit distance one,” Algorithmica, Vol. 74, No. 3, 1099–1122 (2016).MathSciNetzbMATHGoogle Scholar
  19. 19.
    T. Chan and M. Lewenstein, “Fast string dictionary lookup with one error,” in: Proc. CPM’15 (2015), pp. 114–123.Google Scholar
  20. 20.
    M. L. Fredman, J. Komlos, and E. Szemeredi, “Storing a sparse table with O(1) worst case access time,” Journal of the ACM, Vol. 31, No. 3, 538–544 (1984).MathSciNetzbMATHGoogle Scholar
  21. 21.
    R. M. Karp and M. O. Rabin, “Efficient randomized pattern-matching algorithms,” IBM Journal of Research and Development, Vol. 31, No. 2, 249–260 (1987).MathSciNetzbMATHGoogle Scholar
  22. 22.
    M. Mor and A. S. Fraenkel, “A Hash code method for detecting and correcting spelling errors,” Comm. ACM, Vol. 25, No. 12, 935–938 (1982).Google Scholar
  23. 23.
    R. Muth and U. Manber, “Approximate multiple string search,” in: Proc. CPM’96 (1996), pp. 75–86.Google Scholar
  24. 24.
    A. Broder and M. Mitzenmacher, “Network applications of bloom filters: A survey,” Internet Mathematics, Vol. 1, No. 4, 485–509 (2004).MathSciNetzbMATHGoogle Scholar
  25. 25.
    D. Karch, D. Luxen, and P. Sanders, “Improved fast similarity search in dictionaries” in: Proc. SPIRE’10 (2010), pp. 173–178.Google Scholar
  26. 26.
    R. Cole, L.-A. Gottlieb, and M. Lewenstein, “Dictionary matching and indexing with errors and don’t cares,” in: Proc. STOC’04 (2004), pp. 91–100.Google Scholar
  27. 27.
    H. Chan, T. W. Lam, W. Sung, S. Tam, and S. Wong, “Compressed indices for approximate string matching,” Algorithmica, Vol. 58, No. 2, 263–281 (2010).MathSciNetzbMATHGoogle Scholar
  28. 28.
    A. M. Sokolov, “Vector representations for efficient comparison and search for similar strings,” Cybernetics and System Analysis, Vol. 43, No. 4, 484–498 (2007).MathSciNetzbMATHGoogle Scholar
  29. 29.
    A. M. Sokolov, “Investigation of accelerated search for close text sequences with the help of vector representations,” Cybernetics and Systems Analysis, Vol. 44, No. 4, 493–506 (2008).MathSciNetzbMATHGoogle Scholar
  30. 30.
    M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in: Proc. SCG’04 (2004), pp. 253–262.Google Scholar
  31. 31.
    A. Andoni, M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni, “Locality-Sensitive Hashing using stable distributions,” in: G. Shakhnarovich, T. Darrell, and P. Indyk (eds.), Nearest-Neighbor Methods in Learning and Vision: Theory and Practice, MIT Press, Cambridge, MA (2006), pp. 61–72.Google Scholar
  32. 32.
    M. Bawa, T. Condie, and P. Ganesan, “Lsh forest: Self-tuning indices for similarity search,” in: Proc. WWW’05 (2005), pp. 651–660.Google Scholar
  33. 33.
    A. Andoni, I. Razenshteyn, N. Shekel Nosatzki, “Lsh forest: Practical algorithms made theoretical,” in: Proc. SODA’17 (2017), pp. 67–78.Google Scholar
  34. 34.
    H. Zhang and Q. Zhang, “EmbedJoin: Efficient edit similarity joins via embeddings,” in: Proc. KDD’17 (2017), pp. 585–594.Google Scholar
  35. 35.
    D. Chakraborty, E. Goldenberg, and M. Koucky, “Streaming algorithms for embedding and computing edit distance in the low distance regime,” in: Proc. STOC’16 (2016), pp. 712–725.Google Scholar
  36. 36.
    G. Li, D. Deng, J. Wang, and J. Feng, “Pass-join: A partition-based method for similarity joins,” Proc. VLDB Endowment, Vol. 5, No. 3, 253–264 (2011).Google Scholar
  37. 37.
    C. Xiao, W. Wang, and X. Lin, “Ed-Join: An efficient algorithm for similarity joins with edit distance constraints,” Proc. VLDB Endowment, Vol 1, No. 1, 933–944 (2008).MathSciNetGoogle Scholar
  38. 38.
    J. Wang, G. Li, and J. Feng, “Can we beat the prefix filtering? An adaptive framework for similarity join and search,” in: Proc. SIGMOD’12 (2012), pp. 85–96.Google Scholar
  39. 39.
    J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin, “Efficient exact edit similarity query processing with the asymmetric signature scheme,” in: Proc. SIGMOD’11 (2011), pp. 1033–1044.Google Scholar
  40. 40.
    P. Jokinen and E. Ukkonen, “Two algorithms for approximate string matching in static texts,” in: Proc. MFCS’91 (1991), pp. 240–248.Google Scholar
  41. 41.
    L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, “Approximate string joins in a database (almost) for free,” in: Proc. VLDB’01 (2001), pp. 491–500.Google Scholar
  42. 42.
    C. Li, B. Wang, and X. Yang, “VGRAM: Improving performance of approximate queries on string collections using variable-length grams,” in: Proc. VLDB’07 (2007), pp. 303–314.Google Scholar
  43. 43.
    X. Yang, B. Wang, and C. Li, “Cost-based variablelength-gram selection for string collections to support approximate queries efficiently,” in: Proc. SIGMOD’08 (2008), pp. 353–364.Google Scholar
  44. 44.
    T. Kahveci and A. Singh, “An efficient index structure for string databases,” in: Proc. VLDB’01 (2001), pp. 351–360.Google Scholar
  45. 45.
    Y. Jiang, D. Deng, J. Wang, G. Li, and J. Feng, “Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints,” in: Proc. EDBT’13 (2013), pp. 341–348.Google Scholar
  46. 46.
    H. Wei, J. X. Yu, and C. Lu, “String similarity search: A hash-based approach,” IEEE Transactions on Knowledge and Data Engineering, Vol. 30, No. 1, 170–184 (2018).Google Scholar
  47. 47.
    R. Vernicaand and C. Li, “Efficient top-k algorithms for fuzzy search in string collections,” in: Proc. KEYS’09 (2009), pp. 9–14.Google Scholar
  48. 48.
    D. Deng, G. Li, and J. Feng, “A pivotal prefix based filtering algorithm for string similarity search, in: Proc. SIGMOD’14 (2014), pp. 673–684.Google Scholar
  49. 49.
    S. Chaudhuri, V. Ganti, and R. Kaushik, “A primitive operator for similarity joins in data cleaning,” in: Proc. ICDE’06 (2006), pp. 5–16.Google Scholar
  50. 50.
    E. Ukkonen, “Approximate string-matching over suffix trees,” in: A. Apostolico, M. Crochemore, Z. Galil, and U. Manber (eds.), Combinatorial Pattern Matching (CPM 1993); Lecture Notes in Computer Science, Vol 684, 228–242 (1993).Google Scholar
  51. 51.
    T. Bocek, E. Hunt, D. Hausheer, and B. Stiller, “Fast similarity search in peer-to-peer networks,” in: Proc. NOMS’08 (2008), pp. 240–247.Google Scholar
  52. 52.
    W. Wang, C. Xiao, X. Lin, and C. Zhang, “Efficient approximate entity extraction with edit distance constraints,” in: Proc. SIGMOD’09 (2009), pp. 759–770.Google Scholar
  53. 53.
    S. Chaudhuri and R. Kaushik, “Extending autocompletion to tolerate errors,” in: Proc. SIGMOD’09 (2009), pp. 707–718.Google Scholar
  54. 54.
    G. Li, S. Ji, C. Li, and J. Feng, “Efficient fuzzy full-text type-ahead search,” The VLDB Journal, Vol. 20, No. 4, 617–640 (2011).Google Scholar
  55. 55.
    J. Feng, J. Wang, and G. Li, “Trie-Join: A trie-based method for efficient string similarity joins,” The VLDB Journal, Vol. 21, No. 4, 437–461 (2012).Google Scholar
  56. 56.
    Ê. Gouda and M. Rashad, “Efficient string edit similarity join algorithm,” Computing and Informatics, Vol. 36, 683–704 (2017).MathSciNetzbMATHGoogle Scholar
  57. 57.
    S. Wu and U. Manber, “Fast text searching allowing errors,” Comm. ACM, Vol. 35, No. 10, 83–91 (1992).Google Scholar
  58. 58.
    J. Qin, C. Xiao, “Pigeonring: A principle for faster thresholded similarity search,” in: Proc. VLDB Endow, Vol. 12, No. 1, 28–42 (2018).Google Scholar
  59. 59.
    R. Baeza-Yates and G. Navarro, “Faster approximate string matching,” Algorithmica, Vol. 23, No. 2, 127–158 (1999).MathSciNetzbMATHGoogle Scholar
  60. 60.
    G. Navarro, E. Sutinen, and J. Tarhio, “Indexing text with approximate q-grams,” Journal of Discrete Algorithms, Vol. 3, Nos. 2–4, 157–175 (2005).MathSciNetzbMATHGoogle Scholar
  61. 61.
    R. Ostrovsky and Y. Rabani, “Low distortion embedding for edit distance,” Journal of the ACM, Vol. 54, No. 5, 23–36 (2007).MathSciNetzbMATHGoogle Scholar
  62. 62.
    E. Kushilevitz, R. Ostrovsky, and Y. Rabani, “Efficient search for approximate nearest neighbor in high dimensional spaces,” SIAM Journal on Computing, Vol. 30, No. 2, 457–474 (2000).MathSciNetzbMATHGoogle Scholar
  63. 63.
    P. Indyk, “Approximate nearest neighbor under edit distance via product metrics,” in: Proc. SODA’04 (2004), pp. 646–650.Google Scholar
  64. 64.
    P. Indyk, “Approximate nearest neighbor algorithms for Frechet metric via product metrics,” in: Proc. SoCG’02 (2002), pp. 102–106.Google Scholar
  65. 65.
    A. Andoni, P. Indyk, and R. Krauthgamer, “Overcoming the L1 non-embeddability barrier: Algorithms for product metrics,” in: Proc. SODA’09, 865–874 (2009).Google Scholar
  66. 66.
    Z. Yang, J. Yu, and M. Kitsuregawa, “Fast algorithms for top-k approximate string matching,” in: Proc. AAAI’10 (2010), pp. 1467–1473.Google Scholar
  67. 67.
    Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava, “Bed-tree: An all-purpose index structure for string similarity search based on edit distance,” in: Proc. SIGMOD’10 (2010), pp. 915–926.Google Scholar
  68. 68.
    G. M. Morton, A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing, Technical Report, IBM Ltd, Ottawa, Canada (1966).Google Scholar
  69. 69.
    W. Lu, X. Du, M. Hadjieleftheriou, and B. C. Ooi, “Efficiently supporting edit distance based string similarity search using B+-trees,” IEEE Transactions on Knowledge and Data Engineering, Vol. 26, No. 12, 2983–2996 (2014).Google Scholar
  70. 70.
    H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, “iDistance: An adaptive b+-tree based indexing method for nearest neighbor search,” ACM Trans. Database Syst., Vol. 30, No 2, 364–397 (2005).Google Scholar
  71. 71.
    D. Deng, G. Li, J. Feng, and W.-S. Li, “Top-k string similarity search with edit-distance constraints,” in: Proc. ICDE’13 (2013), pp. 925–936.Google Scholar
  72. 72.
    X. Wang, X. Ding, A. K. H. Tung, and Z. Zhang, “Efficient and effective kNN sequence search with approximate n-grams,” Proc. VLDB Endowment, Vol. 7, No. 1, 1–12 (2013).Google Scholar
  73. 73.
    M. Yu, J. Wang, G. Li, Y. Zhang, D. Deng, and J. Feng, “A unified framework for string similarity search with edit-distance constraint,” The VLDB Journal, Vol. 26, 249–274 (2017).Google Scholar
  74. 74.
    D. A. Rachkovskij, “Formation of similarity-reflecting binary vectors with random binary projections,” Cybernetics and Systems Analysis, Vol. 51, No. 2, 313–323 (2012).MathSciNetzbMATHGoogle Scholar
  75. 75.
    D. A. Rachkovskij and V. I. Gritsenko, Distributed Representation of Vector Data Based on Random Projections [in Ukrainian], Interservice, Kyiv (2018).Google Scholar
  76. 76.
    D. A. Rachkovskij and E. G. Revunova, “A randomized method for solving discrete ill-posed problems,” Cybernetics and Systems Analysis, Vol. 48, No. 4, 621–635 (2012).MathSciNetzbMATHGoogle Scholar
  77. 77.
    E. G. Revunova, “Model selection criteria for a linear model to solve discrete ill-posed problems on the basis of singular decomposition and random projection,” Cybernetics and Systems Analysis, Vol. 52, No. 4, 647–664 (2016).MathSciNetzbMATHGoogle Scholar
  78. 78.
    E. G. Revunova, “Averaging over matrices in solving discrete ill-posed problems on the basis of random projection,” in: Proc. CSIT’17 (2017), pp. 473–478.Google Scholar
  79. 79.
    S. McCauley, Approximate similarity search under edit distance using locality-sensitive hashing. arXiv:1907.01600. 2019.Google Scholar
  80. 80.
    A. Rubinstein, “Hardness of approximate nearest neighbor search,” in: Proc. STOC’18 (2018), pp. 1260–1268.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.International Research and Training Center for Information Technologies and SystemsNAS of Ukraine and MES of UkraineKyivUkraine

Personalised recommendations