Cluster Computing

, Volume 22, Supplement 1, pp 1959–1971 | Cite as

Bucket-size balancing locality sensitive hashing using the map reduce paradigm

  • Kyung Mi Lee
  • Yoon-Su Jeong
  • Sang Ho Lee
  • Keon Myung LeeEmail author


Similarity search is an essential operation in such domains as data mining and content-based information retrieval. This simple operation causes considerable burden when the number of data records grows large, especially in big data applications. At the sacrifice of accuracy, approximate methods for finding similar ones have been developed to deliver effective services in a reasonable amount of time. Locality sensitive hashing is a class of efficient approximate similarity search techniques. Various algorithms have been proposed for locality sensitive hashing, which basically try to narrow down the candidate data set to be examined. The candidate data set does not always contain all the similar data to query and thus the search results are approximate. The increase in the size of a candidate set improves the recall of similar ones, but it deteriorates the processing speed. This paper is concerned with a method to increase the recall rate while not entailing significant cost. The method basically uses a random hyperplane partitioning technique to create buckets to which data objects are distributed. The nearest neighbors located on the other side of such hyperplanes can be false negatives when only the bucket to which query belongs is examined for finding similar neighbors. The proposed method extends the hyperplanes to occupy their vicinity so that the data objects in the vicinity of a hyperplane are treated as belonging to both sides of the hyperplane simultaneously. The over-sized buckets are further split by adding additional hyperplanes to control the bucket sizes. To improve the processing speed, the algorithm is realized in MapReduce paradigm on a Hadoop cluster. Some experiment results are presented to show its applicability.


Locality sensitive hashing Similarity search Hadoop MapReduce 



This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) (Grant No. 2015R1D1A1A01061062).


  1. 1.
    Han, L., Ong, H.Y.: Parallel data intensive applications using MapReduce: a data mining case study in biomedical sciences. Clust. Comput. 18(1), 403–418 (2015)CrossRefGoogle Scholar
  2. 2.
    Torralba, A., Fergus, R., Weiss, Y.: Small codes and large image databases for recognition. In: Proceedings of CVPR, pp. 1–8 (2008)Google Scholar
  3. 3.
    Hays, J., Efros, A.A.: Scene completion using millions of photographs. In: Proceedings of SIGGRAPH (2007)Google Scholar
  4. 4.
    Broder, A.Z.: Identify and filtering near-duplicate documents. In: Proceedings of Symposium on Combinatorial Pattern Matching, pp. 1–10 (2000)Google Scholar
  5. 5.
    Sundaram, N., Turmukhametova, A., Satish, N., Mostak, T., Indyk, P., Madden, S., Dubey, P.: Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. In: Proceedings of the VLDB Endowment, Vol. 6, No. 14 (2013)Google Scholar
  6. 6.
    Korelin, V., Blekanov, I.: Hierarchical clustering of large text databases using locality-sensitive hashing. In: Proceedings of the International Conference on Applications in Information Technology, pp. 61–64 (2015)Google Scholar
  7. 7.
    Ture, F., Elsayed, T., Lin, J.: No free lunch: brute force vs locality-sensitive hashing cross-lingual pairwise similarity. In: Proceedings of SIGIR2011, pp. 943–952 (2011)Google Scholar
  8. 8.
    Alewiwi, M., Orencik, C., Savas, E.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Clust. Comput. 19(1), 109–126 (2016)CrossRefGoogle Scholar
  9. 9.
    Lee, K.M., Lee, C.H., Lee, K.M.: Statistical cluster validity indexes to consider cohesion and separation. In: Proceedings of 2012 International Conference on Fuzzy Theory and Its Applications, iFUZZY 2012, pp. 228–232 (2012)Google Scholar
  10. 10.
    Caruana, G., Li, M., Qi, M.: A MapReduce based parallel SVM for large scale spam filtering. In: Proceedings of 8th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 2659–2662 (2011)Google Scholar
  11. 11.
    Rasheed, Z.: A map-reduce framework for clustering metagenomes. In: Proceedings of IEEE 27th International Symposium On Parallel and Distributed Processing, pp. 549–557 (2013)Google Scholar
  12. 12.
    Sunarso, F., Venugopal, S., Lauro, F.: Scalable protein sequence similarity search using locality-sensitive hashing and MapReduce. Technical Report UNSW-CSE-TR-201325, The University of New South Wales (2013). arXiv:1310.0883v1
  13. 13.
    Omohundro, S.: Five balltree construction algorithms. Technical Report, ICSI (1989)Google Scholar
  14. 14.
    Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: Proceedings of CVPR, Vol. 5 (2006)Google Scholar
  15. 15.
    C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006Google Scholar
  16. 16.
    Murphy, K.P.: Machine learning: a probabilistic perspective (2012)Google Scholar
  17. 17.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithm. MIT Press, Cambridge (2009)zbMATHGoogle Scholar
  18. 18.
    Lee, K.M.: Locality-sensitive hashing techniques for nearest neighbor search. Int. J. Fuzzy Logic Intell. Syst. 12(4), 300–307 (2012)CrossRefGoogle Scholar
  19. 19.
    Lee, K.M.: Locality sensitive hashing with replicated coverage. Int. J. Appl. Eng. Res. 9(21), 8747–8756 (2014)Google Scholar
  20. 20.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (2004)Google Scholar
  21. 21.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of STOC (1998)Google Scholar
  22. 22.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of VLDB (1999)Google Scholar
  23. 23.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)CrossRefzbMATHGoogle Scholar
  24. 24.
    Lee, K.M., Lee, K.M.: A locality sensitive hashing technique for categorical data. Appl. Mech. Mater. (2013)Google Scholar
  25. 25.
    Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations, pp. 327–336. ACM Symposium on Theory of Computing (1998)Google Scholar
  26. 26.
    Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distribution. In: Symposium on Computational Geometry, pp. 253–262 (2004)Google Scholar
  27. 27.
    Leskovec, J., Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Cambridge University Press, New York (2014)CrossRefGoogle Scholar
  28. 28.
    Kang, S.J., Lee, S.Y., Lee, K.M.: Performance comparison of OpenMP, MPI, and MapReduce in practical problem. Adv. Multimed. (2015)Google Scholar
  29. 29.
    Verma, A., Cho, B., Zea, N.: Breaking the MapReduce stage barrier. Clust. Comput. 16(1), 191–206 (2013)CrossRefGoogle Scholar
  30. 30.
    Jiang, H., Chen, Y., Qiao, Z., Weng, T.H., Li, K.C.: Scaling up MapReduce-based big data processing on multi-GPU systems. Clust. Comput. 18(1), 369–383 (2015)Google Scholar
  31. 31.
    Bahmani, B., Goel, A., Shinde, R.: Efficient distributed locality sensitive hashing. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2174–2178. ACM (2012)Google Scholar
  32. 32.
    Wang, J., Lin, C.: MapReduce based personalized locality sensitive hashing for similarity joins on large scale data. Computat. Intell. Neuraosci. 2015, 13 (2015)Google Scholar
  33. 33.
    Roh, S.B., Jeong, J.W., Ahn, T.C.: Fuzzy learning vector quantization based on fuzzy k-nearest neighbor prototypes. Int. J. Fuzzy Logic Intell. Syst. 11(2), 84–88 (2011)CrossRefGoogle Scholar
  34. 34.
    Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)CrossRefGoogle Scholar
  35. 35.
    Baluja, S., Covell, M.: Learning forgiving hash functions: algorithms and large scale tests. In: Proceedings of 20th International Joint Conference on Artificial intelligence, pp. 2663– 2669 (2007)Google Scholar
  36. 36.
    Gong, Y., Lazebnik, S., Gordo, A., Perronnin, F.: Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2012)Google Scholar
  37. 37.
    Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of SIGMOD’84 (1984)Google Scholar
  38. 38.
    He, J., Liu, W., Chang, S.-F.: Scalable similarity search with optimized Kernel hashing. In: Proceedings of IEEE International Conference on Knowledge Discovery and Data Mining, pp. 1129–1138 (2010)Google Scholar
  39. 39.
    Jiang, Q., Sun, M.: Semi-supervised simhash for efficient document similarity search. In: Proceedings The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 93–101 (2011)Google Scholar
  40. 40.
    Kulis, B., Grauman, K.: Kernelized locality sensitive hashing. In: Proceedings of 12th International Conference on Computer Vision (2009)Google Scholar
  41. 41.
    Matsushita, Y., Wada, T.: Principal component hashing: an accelerated approximate nearest neighbor search. In: Proceedings of PSIVT (2009)Google Scholar
  42. 42.
    Raginsky, M., Lazebnik, S.: Locality-sensitive binary codes from shift-invariant kernels. In: Proceedings of NIPS (2009)Google Scholar
  43. 43.
    Wang, J., Kumar, S., Chang, S.-F.: SemiSupervised hashing for large scale search. IEEE PAMI, Vol. 34, No. 12 (2012)Google Scholar
  44. 44.
    Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Proceedings of Neural Information Processing Systems, pp. 1753–1760 (2008)Google Scholar
  45. 45.
    Xu, H., Wang, J., Li, Z., Zeng, G., Le, S., Yu, N.: Complementary hashing for approximate nearest neighbor search. In: Proceedings of IEEE International Conference on Computer Vision (2011)Google Scholar
  46. 46.
    Zhang, D., Wang, J., Cai, D., Lu, J.: Self-taught hashing for fast similarity search. In: Proceedings SIGIR, pp. 18–25 (2010)Google Scholar
  47. 47.
    Kim, Y.J., Lee, K.M.: Big numeric data classification using grid-based Bayesian inference in the MapReduce framework. Int. J. Fuzzy Logic Intell. Syst. 14(4), 313–321 (2014)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceChungbuk National UniversityCheongjuKorea
  2. 2.Division of Information and Communication Convergence EngineeringMokwon UniversityDaejeonKorea

Personalised recommendations