Locating similar names through locality sensitive hashing and graph theory

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Locality Sensitive Hashing is a known technique applied for finding similar texts and it has been applied to plagiarism detection, mirror pages identification or to identify the original source of a news article. In this paper we will show how can Locality Sensitive Hashing be applied to identify misspelled people names (name, middle name and last name) or near duplicates. In our case, and due to the short length of the texts, using two similarity functions (the Jaccard Similarity and the Full Damerau-Levenshtein Distance) for measuring the similarity of the names allowed us to obtain better results than using a single one. All the experimental work was made using the statistical software R and the libraries: textreuse and stringdist.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

References

  1. 1.

    Chollampatt S, Hwee Tou N (2018) A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the Association for the Advancement of Artificial Intelligence. New Orleans, Luisiana, USA

  2. 2.

    Csardi G, Nepusz T (2006) The IGraph software package for complex network research. Int J Complex Syst 1695:1–9

    Google Scholar 

  3. 3.

    Damerau FJ (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176

    Article  Google Scholar 

  4. 4.

    Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on P-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp 253–262, New York, NY, USA

  5. 5.

    Fivez P, Suster S, Daelemans W (2017) Unsupervised context-sensitive spelling correction of english and dutch clinical free-text with word and character N-Gram embeddings. Comput Linguistics Netherlands 7:39–52

    Google Scholar 

  6. 6.

    Hopcroft J, Tarjan R (1973) Algorithm 447: efficient algorithms for graph manipulation. Commun ACM 16(6):372–378

    Article  Google Scholar 

  7. 7.

    Karp R (1972) Reducibility among combinatorial problems. Complex Comput Comput 40:85–103

    MathSciNet  Article  Google Scholar 

  8. 8.

    Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12(1):25–53

    MATH  Article  Google Scholar 

  9. 9.

    Lai KH, Topaz M, Goss F, Zhou L (2015) Automated misspelling detection and correction in clinical free-text records. J Biomed Inform 55:188–195

    Article  Google Scholar 

  10. 10.

    Levenshtein V (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady 10(8)

  11. 11.

    Malhotra P, Agarwal P, Shroff G (2014) Graph-parallel entity resolution using LSH and IMM. In: Proceedings of the Workshops of the EDBT/ICDT, vol 1133, pp 41–49

  12. 12.

    Morris MR, Fourney A, Ali A, Vonessen L (2018) Understanding the needs of searchers with dyslexia. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, vol 32. Communications of the ACM, New York, pp 1–35

  13. 13.

    Mullen L (2016) textreuse: Detect Text Reuse and Document Similarity. R package version 0.1.4

  14. 14.

    Paulevé L, Jégou H, Amsaleg L (2010) Locality sensitive hashing: a comparison of hash function types and querying mechanisms. Pattern Recogn Lett 31 (11):1348–1358

    Article  Google Scholar 

  15. 15.

    Pérez A, Atutxa A, Casillas A, Gojenola K, Sellart A (2018) Inferred joint multigram models for medical term normalization according to ICD. Int J Med Inform 110:111–117

    Article  Google Scholar 

  16. 16.

    R Core Team (2015) R: a language and environment for statistical computing. r foundation for statistical computing, Vienna, Austria

  17. 17.

    Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press, New York

    Book  Google Scholar 

  18. 18.

    Satuluri V, Parthasarathy S (2012) Bayesian locality sensitive hashing for fast similarity search. In: Proceedings of the 38th International Conference on Very Large Data Bases, vol 5, pp 430–441, Istambul, Turkey

    Article  Google Scholar 

  19. 19.

    Shapiro D, Japkowicz N, Lemay M, Bolic M (2018) Fuzzy string matching with a deep neural network. Appl Artif Intell 32(1):1–12

    Article  Google Scholar 

  20. 20.

    Subhashree VK, Tharini C (2017) An energy efficient routing and fault tolerant data aggregation (EERFTDA) algorithm for wireless sensor networks. J High Speed Netw 23:15–32

    Article  Google Scholar 

  21. 21.

    Tong Q, Li X, Yuan B (2017) A highly scalable clustering scheme using boundary information. Pattern Recogn Lett 89(1):1–7

    Article  Google Scholar 

  22. 22.

    van der Loo M (2014) The stringdist package for approximate string matching. The R J 6:111–122

    Article  Google Scholar 

  23. 23.

    Wickham H, Francois R (2016) dplyr: A Grammar of Data Manipulation. R package version 0.5.0

  24. 24.

    Yuhua J, Liang B, Peng W, Jinlin G, Yuxiang X, Tianyuan Y (2017) Utilizing locality-sensitive hash learning for cross-media retrieval. In: International conference on multimedia modeling, pp 550–561, Reykjavik, Iceland

    Google Scholar 

Download references

Acknowledgements

This research work was supported by Sungshin Women’s University. In addition, L.J.G.V. and A.L.S.O thanks to RAMSES project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 700326. Website: http://ramses2020.eu.

figureg

Author information

Affiliations

Authors

Corresponding author

Correspondence to Luis Javier García Villalba.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: R Source Code

Appendix: R Source Code

In this appendix the R snippet, used to calculate the buckets of candidates (using Locality Sensitive Hashing) the Jaccard distance and the Damerau-Levenshtein distance, can be found.

The R snippet requires the following libraries: R Base ([16]), Dplyr ([23]), Igraph ([2]), Stringdist ([22]) and Textreuse ([13]).

figureh

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Turrado García, F., García Villalba, L.J., Sandoval Orozco, A.L. et al. Locating similar names through locality sensitive hashing and graph theory. Multimed Tools Appl 78, 29853–29866 (2019). https://doi.org/10.1007/s11042-018-6375-9

Download citation

Keywords

  • Information retrieval
  • Locality sensitive hashing
  • Entity deduplication
  • Textual similarity