Efficient Approximate Similarity Search Using Random Projection Learning

  • Peisen Yuan
  • Chaofeng Sha
  • Xiaoling Wang
  • Bin Yang
  • Aoying Zhou
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6897)

Abstract

Efficient similarity search on high dimensional data is an important research topic in database and information retrieval fields. In this paper, we propose a random projection learning approach for solving the approximate similarity search problem. First, the random projection technique of the locality sensitive hashing is applied for generating the high quality binary codes. Then the binary code is treated as the labels and a group of SVM classifiers are trained with the labeled data for predicting the binary code for the similarity queries. The experiments on real datasets demonstrate that our method substantially outperforms the existing work in terms of preprocessing time and query processing.

Keywords

Binary Vector Query Processing Similarity Search Binary Code Cosine Similarity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)Google Scholar
  2. 2.
    Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)Google Scholar
  3. 3.
    Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: FOCS, pp. 459–468. MIT, Cambridge (2006)Google Scholar
  4. 4.
    Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD, pp. 563–576 (2009)Google Scholar
  5. 5.
    Min, K., Yang, L., Wright, J., Wu, L., Hua, X.S., Ma, Y.: Compact Projection: Simple and Efficient Near Neighbor Search with Practical Memory Requirements. In: CVPR, pp. 3477–3484 (2010)Google Scholar
  6. 6.
    Salakhutdinov, R., Hinton, G.: Semantic Hashing. International Journal of Approximate Reasoning 50(7), 969–978 (2009)CrossRefGoogle Scholar
  7. 7.
    Zhang, D., Wang, J., Cai, D., Lu, J.: Self-taught hashing for fast similarity search. In: SIGIR, pp. 18–25 (2010)Google Scholar
  8. 8.
    Joachims, T.: Training linear SVMs in linear time. In: SIGKDD, pp. 217–226 (2006)Google Scholar
  9. 9.
    Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
  10. 10.
    World Wide Knowledge Base project (2001), http://www.cs.cmu.edu/~webkb/
  11. 11.
  12. 12.
    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. Addison Wesley, Reading (1999)Google Scholar
  13. 13.
    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9), 517 (1975)CrossRefMATHGoogle Scholar
  14. 14.
    Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD, pp. 47–57 (1984)Google Scholar
  15. 15.
    Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-tree: an efficient and robust access method for points and rectangles. SIGMOD 19(2), 322–331 (1990)CrossRefGoogle Scholar
  16. 16.
    Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, pp. 194–205 (1998)Google Scholar
  17. 17.
    Fagin, R., Kumar, R., Sivakumar, D.: Efficient similarity search and classification via rank aggregation. In: SIGMOD, pp. 301–312 (2003)Google Scholar
  18. 18.
    Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences 66(4), 614–656 (2003)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Yao, B., Li, F., Kumar, P.: k-nearest neighbor queries and knn-joins in large relational databases (almost) for free. In: ICDE, pp. 4–15 (2010)Google Scholar
  20. 20.
    Ramsak, F., Markl, V., Fenk, R., Zirkel, M., Elhardt, K., Bayer, R.: Integrating the UB-tree into a database system kernel. In: VLDB, pp. 263–272 (2000)Google Scholar
  21. 21.
    Liao, S., Lopez, M., Leutenegger, S.: High dimensional similarity search with space filling curves. In: ICDE, pp. 615–622 (2001)Google Scholar
  22. 22.
    Baluja, S., Covell, M.: Learning to hash: forgiving hash functions and applications. Data Mining and Knowledge Discovery 17(3), 402–430 (2008)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. NIPS 21, 1753–1760 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Peisen Yuan
    • 1
  • Chaofeng Sha
    • 1
  • Xiaoling Wang
    • 2
  • Bin Yang
    • 1
  • Aoying Zhou
    • 2
  1. 1.School of Computer Science, Shanghai Key Laboratory of Intelligent Information ProcessingFudan UniversityShanghaiP.R. China
  2. 2.Shanghai Key Laboratory of Trustworthy Computing, Software Engineering InstituteEast China Normal UniversityShanghaiP.R. China

Personalised recommendations