MIST: Top-k Approximate Sub-string Mining Using Triplet Statistical Significance

  • Sourav Dutta
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9022)

Abstract

Efficient extraction of strings or sub-strings similar to an input query string forms a necessity in applications like instant search, record linkage, etc., where the similarity between two strings is usually quantified by edit distance. This paper proposes a novel top-k approximate sub-string matching algorithm, MIST, for a given query, based on Chi-squared statistical significance of string triplets, thereby avoiding expensive edit distance computation. Experiments with real-life data validate the run-time effectiveness and accuracy of our algorithm.

Keywords

Approx. string search Edit distance χ2 statistical significance n-grams 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baeza-Yates, R., Navarro, G.: New and Faster Filters for Multiple Approximate String Matching. Random Structures & Algorithms 20(1), 23–49 (2002)CrossRefMATHMathSciNetGoogle Scholar
  2. 2.
    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval - the Concepts and Technology behind Search. Pearson Edu. Ltd. (2011)Google Scholar
  3. 3.
    Cucerzan, S., Brill, E.: Spelling Corrections as an Interactive Process that Exploits the Collective Knowledge of Web Users. In: EMNLP, pp. 293–300 (2004)Google Scholar
  4. 4.
    Deng, D., Li, G., Feng, J., Li, W.S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936 (2013)Google Scholar
  5. 5.
    Dutta, S., Bhattacharya, A.: Most Significant Substring Mining based on Chi-Square Measure. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010, Part I. LNCS, vol. 6118, pp. 319–327. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U.: Efficient Similarity Search in Very Large String Sets. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 262–279. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  7. 7.
    Hotelling, H.: Multivariate Quality Control. Tech. of Statistical Analysis 54, 111–184 (1947)Google Scholar
  8. 8.
    Kahveci, T., Singh, A.K.: Efficient Index Structures for String Databases. In: VLDB, pp. 351–360 (2001)Google Scholar
  9. 9.
    Keogh, E., Lonardi, S., Chiu, B.: Finding Surprising Patterns in a Time Series Database in Linear Time and Space. In: SIGKDD, pp. 550–556 (2002)Google Scholar
  10. 10.
    Kim, Y., Shim, K.: Efficient Top-k Algorithms for Approximate Substring Matching. In: SIGMOD, pp. 385–396 (2013)Google Scholar
  11. 11.
    Kimura, M., Takasu, A., Adachi, J.: FPI: A Novel Indexing Method Using Frequent Patterns for Approximate String Searches. In: EDBT Workshops, pp. 397–403 (2013)Google Scholar
  12. 12.
    Kukich, K.: Techniques for Automatically Correcting Words in Texts. ACM Computing Surveys 24(4), 377–439 (1992)CrossRefGoogle Scholar
  13. 13.
    Levenshtein, V.I.: Binary Codes capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  14. 14.
    Li, C., Lu, J., Lu, Y.: Efficient Merging and Filtering Algorithms for Apprx. String Searches. In: ICDE, pp. 257–266 (2008)Google Scholar
  15. 15.
    Li, C., Wang, B., Yang, X.: VGRAM: Improving Performance of Approximate Queries on String Collections using Variable-length Grams. In: VLDB, pp. 303–314 (2007)Google Scholar
  16. 16.
    Myers, G.: A Sublinear Algorithm for Approximate Keyword Searching. Algorithmica 12(4), 345–374 (1994)CrossRefMATHMathSciNetGoogle Scholar
  17. 17.
    Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 33(1), 31–88 (2001)CrossRefGoogle Scholar
  18. 18.
    Patil, M., Cai, X., Thankachan, S.V., Shah, R., Park, S.J., Foltz, D.: Approximate String Matching by Position Restricted Alignment. In: EDBT, pp. 384–391 (2013)Google Scholar
  19. 19.
    Read, T., Cressie, N.: Goodness-of-fit Stats. for Discrete Multivariate Data. Springer (1988)Google Scholar
  20. 20.
    Yang, Z., Yu, J., Kitsuregawa, M.: Fast Algorithms for Top-k Approximate String Matching. In: AAAI, pp. 1467–1473 (2010)Google Scholar
  21. 21.
    Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: Bed-Tree: An All-purpose Index Structure for String Similarity Search based on Edit Dist. In: SIGMOD, pp. 915–926 (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Sourav Dutta
    • 1
  1. 1.Max-Planck Institute for InformaticsGermany

Personalised recommendations