Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing

Fisichella, Marco; Deng, Fan; Nejdl, Wolfgang

doi:10.1007/978-3-642-15364-8_11

Marco Fisichella¹⁹,
Fan Deng¹⁹ &
Wolfgang Nejdl¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6261))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1147 Accesses
2 Citations

Abstract

In this paper, we study the problem of detecting near duplicates for high dimensional data points in an incremental manner. For example, for an image sharing website, it would be a desirable feature if near-duplicates can be detected whenever a user uploads a new image into the website so that the user can take some action such as stopping the upload or reporting an illegal copy. Specifically, whenever a new point arrives, our goal is to find all points within an existing point set that are close to the new point based on a given distance function and a distance threshold before the new point is inserted into the data set. Based on a well-known indexing technique, Locality Sensitive Hashing, we propose a new approach which clearly speeds up the running time of LSH indexing while using only a small amount of extra space. The idea is to store a small fraction of near duplicate pairs within the existing point set which are found when they are inserted into the data set, and use them to prune LSH candidate sets for the newly arrived point. Extensive experiments based on three real-world data sets show that our method consistently outperforms the original LSH approach: to reach the same query response time, our method needs significantly less memory than the original LSH approach. Meanwhile, the LSH theoretical guarantee on the quality of the search result is preserved by our approach. Furthermore, it is easy to implement our approach based on LSH.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Complete version of this paper can be found at, https://www.l3s.de/web/upload/documents/1/SimSearch_complete.pdf
Andoni, A., Indyk, P., Patrascu, M.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: FOCS, pp. 459–468 (2006)
Google Scholar
Andoni, A., Indyk, P.: E² LSH0.1 User Manual. http://web.mit.edu/andoni/www/LSH/manual.pdf (2005)
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. CACM 51(1) (2008)
Google Scholar
Bawa, M., Condie, T., Ganesan, P.: Lsh forest: self-tuning indexes for similarity search. In: WWW, pp. 651–660 (2005)
Google Scholar
Bentley, J.L.: Multidimensional binary search trees used for associative searching. CACM 18(9) (1975)
Google Scholar
Berchtold, S., Böhm, C., Jagadish, H.V., Kriegel, H.-P., Sander, J.: Independent quantization: An index compression technique for high-dimensional data spaces. In: ICDE, pp. 577–588 (2000)
Google Scholar
Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: ICML, pp. 97–104 (2006)
Google Scholar
Chum, O., Philbin, J., Isard, M., Zisserman, A.: Scalable near identical image and shot detection. In: CIVR, pp. 549–556 (2007)
Google Scholar
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG, pp. 253–262 (2004)
Google Scholar
Foo, J.J., Sinha, R., Zobel, J.: Discovery of image versions in large collections. In: Cham, T.-J., Cai, J., Dorai, C., Rajan, D., Chua, T.-S., Chia, L.-T. (eds.) MMM 2007. LNCS, vol. 4352, pp. 433–442. Springer, Heidelberg (2006)
Chapter Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 518–529 (1999)
Google Scholar
Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Prentice Hall, Englewood Cliffs (2007)
Google Scholar
Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD (1984)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)
Google Scholar
Katayama, N., Satoh, S.: The sr-tree: An index structure for high-dimensional nearest neighbor queries. In: SIGMOD (1997)
Google Scholar
Ke, Y., Sukthankar, R., Huston, L.: An efficient parts-based near-duplicate and sub-image retrieval system. In: ACM Multimedia, pp. 869–876 (2004)
Google Scholar
Koudas, N., Ooi, B.C., Shen, H.T., Tung, A.K.H.: Ldc: Enabling search by partial distance in a hyper-dimensional space. In: ICDE, pp. 6–17 (2004)
Google Scholar
Krauthgamer, R., Lee, J.R.: Navigating nets: simple algorithms for proximity search. In: SODA, pp. 798–807 (2004)
Google Scholar
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In: VLDB, pp. 950–961 (2007)
Google Scholar
Panigrahy, R.: Entropy based nearest neighbor search in high dimensions. In: SODA, pp. 1186–1195 (2006)
Google Scholar
Sakurai, Y., Yoshikawa, M., Uemura, S., Kojima, H.: The a-tree: An index structure for high-dimensional spaces using relative approximation. In: VLDB, pp. 516–526 (2000)
Google Scholar
Samet, H.: Foundations of Multidimensional and Metric Data Structures, August 8, 2006. Morgan Kaufmann, San Francisco (2006)
MATH Google Scholar
Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, pp. 194–205 (1998)
Google Scholar
Yu, C., Ooi, B.C., Tan, K.-L., Jagadish, H.V.: Indexing the distance: An efficient method to knn processing. In: VLDB (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Forschungszentrum L3S, Hannover, 30167, Germany
Marco Fisichella, Fan Deng & Wolfgang Nejdl

Authors

Marco Fisichella
View author publications
You can also search for this author in PubMed Google Scholar
Fan Deng
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Nejdl
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

DeustoTech Computing, University of Deusto, Avda. Universidades, 24, 48007, Bilbao, Spain
Pablo García Bringas
Institut de Recherche en Informatique de Toulouse (IRIT), Paul Sabatier University, 118, route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain
Faculty of Computer Science, Department of Distributed Systems and Multimedia Systems, University of Vienna, Liebiggasse 4/3-4, 1010, Vienna, Austria
Gerald Quirchmayr

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fisichella, M., Deng, F., Nejdl, W. (2010). Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds) Database and Expert Systems Applications. DEXA 2010. Lecture Notes in Computer Science, vol 6261. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15364-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-15364-8_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15363-1
Online ISBN: 978-3-642-15364-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics