The VLDB Journal

, Volume 26, Issue 2, pp 249–274

A unified framework for string similarity search with edit-distance constraint

  • Minghe Yu
  • Jin Wang
  • Guoliang Li
  • Yong Zhang
  • Dong Deng
  • Jianhua Feng
Regular Paper

DOI: 10.1007/s00778-016-0449-y

Cite this article as:
Yu, M., Wang, J., Li, G. et al. The VLDB Journal (2017) 26: 249. doi:10.1007/s00778-016-0449-y
  • 91 Downloads

Abstract

String similarity search is a fundamental operation in data cleaning and integration. It has two variants: threshold-based string similarity search and top-\(k\) string similarity search. Existing algorithms are efficient for either the former or the latter; most of them cannot support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index (\({\textsf {HS}}{\text {-}}{\textsf {Tree}}\)) on top of the segments. Then, we utilize the \({\textsf {HS}}{\text {-}}{\textsf {Tree}}\) to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-\(k\) search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings and propose an algorithm (HS-Topk). We develop effective pruning techniques to further improve the performance. To support large data sets, we extend our techniques to support the disk-based setting. Experimental results on real-world data sets show that our method achieves high performance on the two problems and outperforms state-of-the-art algorithms by 5–10 times.

Keywords

Similarity search Edit distance Top-k Disk-based method Partition 

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Minghe Yu
    • 1
  • Jin Wang
    • 1
  • Guoliang Li
    • 1
  • Yong Zhang
    • 1
  • Dong Deng
    • 1
  • Jianhua Feng
    • 1
  1. 1.Department of Computer Science and TechnologyTsinghua UniversityBeijingChina