Locating similar names through locality sensitive hashing and graph theory
Locality Sensitive Hashing is a known technique applied for finding similar texts and it has been applied to plagiarism detection, mirror pages identification or to identify the original source of a news article. In this paper we will show how can Locality Sensitive Hashing be applied to identify misspelled people names (name, middle name and last name) or near duplicates. In our case, and due to the short length of the texts, using two similarity functions (the Jaccard Similarity and the Full Damerau-Levenshtein Distance) for measuring the similarity of the names allowed us to obtain better results than using a single one. All the experimental work was made using the statistical software R and the libraries: textreuse and stringdist.
KeywordsInformation retrieval Locality sensitive hashing Entity deduplication Textual similarity
This research work was supported by Sungshin Women’s University. In addition, L.J.G.V. and A.L.S.O thanks to RAMSES project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 700326. Website: http://ramses2020.eu.
- 1.Chollampatt S, Hwee Tou N (2018) A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the Association for the Advancement of Artificial Intelligence. New Orleans, Luisiana, USAGoogle Scholar
- 2.Csardi G, Nepusz T (2006) The IGraph software package for complex network research. Int J Complex Syst 1695:1–9Google Scholar
- 4.Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on P-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp 253–262, New York, NY, USAGoogle Scholar
- 5.Fivez P, Suster S, Daelemans W (2017) Unsupervised context-sensitive spelling correction of english and dutch clinical free-text with word and character N-Gram embeddings. Comput Linguistics Netherlands 7:39–52Google Scholar
- 10.Levenshtein V (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady 10(8)Google Scholar
- 11.Malhotra P, Agarwal P, Shroff G (2014) Graph-parallel entity resolution using LSH and IMM. In: Proceedings of the Workshops of the EDBT/ICDT, vol 1133, pp 41–49Google Scholar
- 12.Morris MR, Fourney A, Ali A, Vonessen L (2018) Understanding the needs of searchers with dyslexia. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, vol 32. Communications of the ACM, New York, pp 1–35Google Scholar
- 13.Mullen L (2016) textreuse: Detect Text Reuse and Document Similarity. R package version 0.1.4Google Scholar
- 16.R Core Team (2015) R: a language and environment for statistical computing. r foundation for statistical computing, Vienna, AustriaGoogle Scholar
- 23.Wickham H, Francois R (2016) dplyr: A Grammar of Data Manipulation. R package version 0.5.0Google Scholar
- 24.Yuhua J, Liang B, Peng W, Jinlin G, Yuxiang X, Tianyuan Y (2017) Utilizing locality-sensitive hash learning for cross-media retrieval. In: International conference on multimedia modeling, pp 550–561, Reykjavik, IcelandGoogle Scholar