Advertisement

Multimedia Tools and Applications

, Volume 78, Issue 21, pp 29853–29866 | Cite as

Locating similar names through locality sensitive hashing and graph theory

  • Fernando Turrado García
  • Luis Javier García VillalbaEmail author
  • Ana Lucila Sandoval Orozco
  • Francisco Damián Aranda Ruiz
  • Andrés Aguirre Juárez
  • Tai-Hoon Kim
Article

Abstract

Locality Sensitive Hashing is a known technique applied for finding similar texts and it has been applied to plagiarism detection, mirror pages identification or to identify the original source of a news article. In this paper we will show how can Locality Sensitive Hashing be applied to identify misspelled people names (name, middle name and last name) or near duplicates. In our case, and due to the short length of the texts, using two similarity functions (the Jaccard Similarity and the Full Damerau-Levenshtein Distance) for measuring the similarity of the names allowed us to obtain better results than using a single one. All the experimental work was made using the statistical software R and the libraries: textreuse and stringdist.

Keywords

Information retrieval Locality sensitive hashing Entity deduplication Textual similarity 

Notes

Acknowledgements

This research work was supported by Sungshin Women’s University. In addition, L.J.G.V. and A.L.S.O thanks to RAMSES project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 700326. Website: http://ramses2020.eu.

References

  1. 1.
    Chollampatt S, Hwee Tou N (2018) A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the Association for the Advancement of Artificial Intelligence. New Orleans, Luisiana, USAGoogle Scholar
  2. 2.
    Csardi G, Nepusz T (2006) The IGraph software package for complex network research. Int J Complex Syst 1695:1–9Google Scholar
  3. 3.
    Damerau FJ (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176CrossRefGoogle Scholar
  4. 4.
    Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on P-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp 253–262, New York, NY, USAGoogle Scholar
  5. 5.
    Fivez P, Suster S, Daelemans W (2017) Unsupervised context-sensitive spelling correction of english and dutch clinical free-text with word and character N-Gram embeddings. Comput Linguistics Netherlands 7:39–52Google Scholar
  6. 6.
    Hopcroft J, Tarjan R (1973) Algorithm 447: efficient algorithms for graph manipulation. Commun ACM 16(6):372–378CrossRefGoogle Scholar
  7. 7.
    Karp R (1972) Reducibility among combinatorial problems. Complex Comput Comput 40:85–103MathSciNetCrossRefGoogle Scholar
  8. 8.
    Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12(1):25–53zbMATHCrossRefGoogle Scholar
  9. 9.
    Lai KH, Topaz M, Goss F, Zhou L (2015) Automated misspelling detection and correction in clinical free-text records. J Biomed Inform 55:188–195CrossRefGoogle Scholar
  10. 10.
    Levenshtein V (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady 10(8)Google Scholar
  11. 11.
    Malhotra P, Agarwal P, Shroff G (2014) Graph-parallel entity resolution using LSH and IMM. In: Proceedings of the Workshops of the EDBT/ICDT, vol 1133, pp 41–49Google Scholar
  12. 12.
    Morris MR, Fourney A, Ali A, Vonessen L (2018) Understanding the needs of searchers with dyslexia. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, vol 32. Communications of the ACM, New York, pp 1–35Google Scholar
  13. 13.
    Mullen L (2016) textreuse: Detect Text Reuse and Document Similarity. R package version 0.1.4Google Scholar
  14. 14.
    Paulevé L, Jégou H, Amsaleg L (2010) Locality sensitive hashing: a comparison of hash function types and querying mechanisms. Pattern Recogn Lett 31 (11):1348–1358CrossRefGoogle Scholar
  15. 15.
    Pérez A, Atutxa A, Casillas A, Gojenola K, Sellart A (2018) Inferred joint multigram models for medical term normalization according to ICD. Int J Med Inform 110:111–117CrossRefGoogle Scholar
  16. 16.
    R Core Team (2015) R: a language and environment for statistical computing. r foundation for statistical computing, Vienna, AustriaGoogle Scholar
  17. 17.
    Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press, New YorkCrossRefGoogle Scholar
  18. 18.
    Satuluri V, Parthasarathy S (2012) Bayesian locality sensitive hashing for fast similarity search. In: Proceedings of the 38th International Conference on Very Large Data Bases, vol 5, pp 430–441, Istambul, TurkeyCrossRefGoogle Scholar
  19. 19.
    Shapiro D, Japkowicz N, Lemay M, Bolic M (2018) Fuzzy string matching with a deep neural network. Appl Artif Intell 32(1):1–12CrossRefGoogle Scholar
  20. 20.
    Subhashree VK, Tharini C (2017) An energy efficient routing and fault tolerant data aggregation (EERFTDA) algorithm for wireless sensor networks. J High Speed Netw 23:15–32CrossRefGoogle Scholar
  21. 21.
    Tong Q, Li X, Yuan B (2017) A highly scalable clustering scheme using boundary information. Pattern Recogn Lett 89(1):1–7CrossRefGoogle Scholar
  22. 22.
    van der Loo M (2014) The stringdist package for approximate string matching. The R J 6:111–122CrossRefGoogle Scholar
  23. 23.
    Wickham H, Francois R (2016) dplyr: A Grammar of Data Manipulation. R package version 0.5.0Google Scholar
  24. 24.
    Yuhua J, Liang B, Peng W, Jinlin G, Yuxiang X, Tianyuan Y (2017) Utilizing locality-sensitive hash learning for cross-media retrieval. In: International conference on multimedia modeling, pp 550–561, Reykjavik, IcelandGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Fernando Turrado García
    • 1
  • Luis Javier García Villalba
    • 1
    Email author
  • Ana Lucila Sandoval Orozco
    • 1
  • Francisco Damián Aranda Ruiz
    • 1
  • Andrés Aguirre Juárez
    • 1
  • Tai-Hoon Kim
    • 2
  1. 1.Group of Analysis, Security and Systems (GASS), Department of Software Engineering and Artificial Intelligence (DISIA), Faculty of Information Technology and Computer Science, Office 431Universidad Complutense de Madrid (UCM)MadridSpain
  2. 2.Department of Convergence SecuritySungshin Women’s UniversitySeoulKorea

Personalised recommendations