Efficient Similarity Search in Very Large String Sets

  • Dandy Fenz
  • Dustin Lange
  • Astrid Rheinländer
  • Felix Naumann
  • Ulf Leser
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7338)

Abstract

String similarity search is required by many real-life applications, such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and a query string, the string similarity search problem is to efficiently find all strings in the string set that are similar to the query string. Similarity is defined using a similarity (or distance) measure, such as edit distance or Hamming distance. In this paper, we introduce the State Set Index (SSI) as an efficient solution for this search problem.

SSI is based on a trie (prefix index) that is interpreted as a nondeterministic finite automaton. SSI implements a novel state labeling strategy making the index highly space-efficient. Furthermore, SSI’s space consumption can be gracefully traded against search time.

We evaluated SSI on different sets of person names with up to 170 million strings from a social network and compared it to other state-of-the-art methods. We show that in the majority of cases, SSI is significantly faster than other tools and requires less index space.

Keywords

Edit Distance Index Size Query Response Time Query String Approximate String Match 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aghili, S.A., Agrawal, D.P., El Abbadi, A.: BFT: Bit Filtration Technique for Approximate String Join in Biological Databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 326–340. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  2. 2.
    Behm, A., Vernica, R., Alsubaiee, S., Ji, S., Lu, J., Jin, L., Lu, Y., Li, C.: UCI Flamingo Package 4.0 (2011)Google Scholar
  3. 3.
    Bocek, T., Hunt, E., Stiller, B.: Fast Similarity Search in Large Dictionaries. Technical report, Department of Informatics, University of Zurich (2007)Google Scholar
  4. 4.
    Celikik, M., Bast, H.: Fast error-tolerant search on very large texts. In: Proc. of the ACM Symposium on Applied Computing (SAC), pp. 1724–1731 (2009)Google Scholar
  5. 5.
    Fickett, J.W.: Fast optimal alignment. Nucleic Acids Research 12(1), 175–179 (1984)CrossRefGoogle Scholar
  6. 6.
    Fredkin, E.: Trie memory. Commun. of the ACM 3, 490–499 (1960)CrossRefGoogle Scholar
  7. 7.
    Grahne, G., Zhu, J.: Efficiently using prefix-trees in mining frequent itemsets. In: Proc. of the ICDM Workshop on Frequent Itemset Mining Implementations (2003)Google Scholar
  8. 8.
    Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (Almost) for free. In: Proc. of the Intl. Conf. on Very Large Databases (VLDB), pp. 491–500. Morgan Kaufmann (2001)Google Scholar
  9. 9.
    Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Proc. of the Intl. World Wide Web Conf. (WWW), pp. 90–101 (2003)Google Scholar
  10. 10.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997)Google Scholar
  11. 11.
    Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: A Frequent-Pattern tree approach. Data Mining and Knowledge Discovery 8(1) (2004)Google Scholar
  12. 12.
    Jampani, R., Pudi, V.: Using Prefix-Trees for Efficiently Computing Set Joins. In: Zhou, L., Ooi, B.-C., Meng, X. (eds.) DASFAA 2005. LNCS, vol. 3453, pp. 761–772. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  13. 13.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady (1966)Google Scholar
  14. 14.
    Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: Proc. of the Intl. Conf. on Data Engineering (ICDE), pp. 257–266. IEEE Computer Society (2008)Google Scholar
  15. 15.
    Liu, X., Li, G., Feng, J., Zhou, L.: Effective indices for efficient approximate string search and similarity join. In: Proc. of the Intl. Conf. on Web-Age Information Management, pp. 127–134. IEEE Computer Society (2008)Google Scholar
  16. 16.
    Morrison, D.R.: PATRICIA – practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM 15(4), 514–534 (1968)CrossRefGoogle Scholar
  17. 17.
    Myers, E.: A sublinear algorithm for approximate keyword searching. Algorithmica 12, 345–374 (1994)MathSciNetMATHCrossRefGoogle Scholar
  18. 18.
    Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM 46(3), 395–415 (1999)MathSciNetMATHCrossRefGoogle Scholar
  19. 19.
    Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1) (2001)Google Scholar
  20. 20.
    Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24, 2001 (2000)Google Scholar
  21. 21.
    Rabin, M.O., Scott, D.: Finite automata and their decision problems. IBM J. Res. Dev. 3, 114–125 (1959)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Rheinländer, A., Knobloch, M., Hochmuth, N., Leser, U.: Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data. In: Gertz, M., Ludäscher, B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 519–536. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  23. 23.
    Rheinländer, A., Leser, U.: Scalable sequence similarity search in main memory on multicores. In: International Workshop on High Performance in Bioinformatics and Biomedicine, HiBB (2011)Google Scholar
  24. 24.
    Sahinalp, S.C., Tasan, M., Macker, J., Ozsoyoglu, Z.M.: Distance based indexing for string proximity search. In: Proc. of the Intl. Conf. on Data Engineering (ICDE), pp. 125–136 (2003)Google Scholar
  25. 25.
    Shang, H., Merrett, T.: Tries for approximate string matching. IEEE Transactions on Knowledge and Data Engineering (TKDE) 8, 540–547 (1996)CrossRefGoogle Scholar
  26. 26.
    Vintsyuk, T.K.: Speech discrimination by dynamic programming. Cybernetics and Systems Analysis 4, 52–57 (1968)Google Scholar
  27. 27.
    Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: Proc. of the ACM Intl. Conf. on Management of Data (SIGMOD), pp. 759–770 (2009)Google Scholar
  28. 28.
    Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. of the VLDB Endowment 1, 933–944 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Dandy Fenz
    • 1
  • Dustin Lange
    • 1
  • Astrid Rheinländer
    • 2
  • Felix Naumann
    • 1
  • Ulf Leser
    • 2
  1. 1.Hasso Plattner InstitutePotsdamGermany
  2. 2.Department of Computer ScienceHumboldt-Universität zu BerlinBerlinGermany

Personalised recommendations