Fast Near Neighbor Search in High-Dimensional Binary Data

  • Anshumali Shrivastava
  • Ping Li
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7523)


Numerous applications in search, databases, machine learning, and computer vision, can benefit from efficient algorithms for near neighbor search. This paper proposes a simple framework for fast near neighbor search in high-dimensional binary data, which are common in practice (e.g., text). We develop a very simple and effective strategy for sub-linear time near neighbor search, by creating hash tables directly using the bits generated by b-bit minwise hashing. The advantages of our method are demonstrated through thorough comparisons with two strong baselines: spectral hashing and sign (1-bit) random projections.


Neighbor Search Hash Table Collision Probability Query Point Random Projection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Tong, S.: Lessons learned developing a practical large scale machine learning system (2008),
  2. 2.
    Li, P., König, A.C.: b-bit minwise hashing. In: WWW, Raleigh, NC, 671–680 (2010)Google Scholar
  3. 3.
    Li, P., Shrivastava, A., Moore, J., König, A.C.: Hashing algorithms for large-scale learning. In: NIPS, Vancouver, BC (2011)Google Scholar
  4. 4.
    Broder, A.Z.: On the resemblance and containment of documents. In: The Compression and Complexity of Sequences, Positano, Italy, pp. 21–29 (1997)Google Scholar
  5. 5.
    Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: WWW, Santa Clara, CA, pp. 1157–1166 (1997)Google Scholar
  6. 6.
    Fetterly, D., Manasse, M., Najork, M., Wiener, J.L.: A large-scale study of the evolution of web pages. In: WWW, Budapest, Hungary, pp. 669–678 (2003)Google Scholar
  7. 7.
    Manku, G.S., Jain, A., Sarma, A.D.: Detecting Near-Duplicates for Web-Crawling. In: WWW, Banff, Alberta, Canada (2007)Google Scholar
  8. 8.
    Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS (2008)Google Scholar
  9. 9.
    Goemans, M.X., Williamson, D.P.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM 42(6), 1115–1145 (1995)MathSciNetzbMATHCrossRefGoogle Scholar
  10. 10.
    Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC, Montreal, Quebec, Canada, pp. 380–388 (2002)Google Scholar
  11. 11.
    Li, P., Hastie, T.J., Church, K.W.: Improving Random Projections Using Marginal Information. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 635–649. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Friedman, J.H., Baskett, F., Shustek, L.: An algorithm for finding nearest neighbors. IEEE Transactions on Computers 24, 1000–1006 (1975)zbMATHCrossRefGoogle Scholar
  13. 13.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: STOC, Dallas, TX, pp. 604–613 (1998)Google Scholar
  14. 14.
    Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 117–122 (2008)CrossRefGoogle Scholar
  15. 15.
    Rajaraman, A., Ullman, J.: Mining of Massive Datasets,
  16. 16.
    Salakhutdinov, R., Hinton, G.E.: Semantic hashing. Int. J. Approx. Reasoning 50(7), 969–978 (2009)CrossRefGoogle Scholar
  17. 17.
    Li, Z., Ning, H., Cao, L., Zhang, T., Gong, Y., Huang, T.S.: Learning to search efficiently in high dimensions. In: NIPS (2011)Google Scholar
  18. 18.
    Li, P.: Image classification with hashing on locally and gloablly expanded features. Technical reportGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Anshumali Shrivastava
    • 1
  • Ping Li
    • 1
  1. 1.Cornell UniversityIthacaUSA

Personalised recommendations