Persistent clustered main memory index for accelerating k-NN queries on high dimensional datasets

Thomasian, Alexander; Zhang, Lijuan

doi:10.1007/s11042-007-0179-7

Persistent clustered main memory index for accelerating k-NN queries on high dimensional datasets

Published: 02 November 2007

Volume 38, pages 253–270, (2008)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Alexander Thomasian^1,2 &
Lijuan Zhang^1,3

55 Accesses
6 Citations
Explore all metrics

Abstract

Similarity search implemented via k-nearest neighbor— k-NN queries on multidimensional indices is an extremely useful paradigm for content-based image retrieval. As the dimensionality of feature vectors increases the curse of dimensionality sets in, i.e., the performance of k-NN search of disk-resident indices in the R-tree family degrades rapidly due to the overlap in index pages in high dimensions. This problem is dealt with in this study by utilizing the double filtering effect of clustering and indexing. The clustering algorithm ensures that the largest cluster fits into main memory and that only clusters closest to a query point need to be searched and hence loaded into main memory. We organize the data in each cluster according to the ordered-partition—OP-tree main memory resident index, which is not prone to the curse of dimensionality and highly efficient for processing k-NN queries. We serialize an OP-tree by writing its dynamically allocated nodes into contiguous memory locations, optimize its parameters, and make it persistent by writing it to disk. The time to read and write clusters constituting an OP-tree with a single sequential access to disk benefits from higher data transfer rates of modern disk drives. The performance of the index is further improved by applying the Karhunen–Loève transformation—KLT to the dataset, since this results in a more efficient computation of distances for k-NN queries. We compare OP-trees and sequential scans with and without a KL-transformation and with and without using a shortcut method in calculating Euclidean distances. A comparison against the OMNI-sequential scan is also reported. We finally compare a clustered and persistent version of the OP-tree against a clustered version of the SR-tree and the VA-file method. CPU time is measured and elapsed time is estimated in this study. It is observed that the OP-tree index outperforms the other two methods and that the improvement increases with the number of dimensions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multidimensional scaling for big data

Article Open access 13 April 2024

Redis-based full-text search extensions for relational databases

Article 12 April 2024

Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study

References

Barbará D (1997) The New Jersey data reduction report. IEEE Data Eng Bull 20(4):3–45
Google Scholar
Bentley JL (1979) Multidimensional binary search in database applications. IEEE Trans Software Eng 4(5):333–340
Article MathSciNet Google Scholar
Beckmann N, Kriegel HP, Schnieder R, Seeger B (1990) The R ^*-tree: an efficient and robust access method for points and rectangles. Proc. ACM SIGMOD int’l conf. on management of data. Atlantic City, NJ, pp 322–331, May
Berchtold S, Keim DA, Kriegel HP (1996) The X-tree: an index structure for high-dimensional data. Proc. 22nd int’l conf. on very large data bases—VLDB. San Jose, CA, pp 28–39, August
Böhm C, Kriegel H-P (2000) Dynamically optimizing high-dimensional index structures. Proc. 7th int’l conf. on extending database technology—EDBT, Konstanz, Germany, pp 36–50, March
Böhm C, Berchtold S, Keim DA (2001) Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases. ACM Comput Surv 33(3):322–373
Article Google Scholar
Castelli V (2002) Multidimensional indexing structures for content-based retrieval. In: Castelli V, Bergman L (eds) Image databases: search and retrieval of digital imagery. Wiley, pp 373–434
Castelli V, Bergman L (eds) (2002) Image databases: search and retrieval of digital imagery. Wiley
Castelli V, Thomasian A, Li C-S (2003) CSVD: clustering and singular value decomposition for approximate similarity search in high-dimensional spaces. IEEE Trans Knowl Data Eng 15(3):671–685, May/June
Article Google Scholar
Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. Proc. 26th int’l conf. on very large data bases—VLDB. Cairo, Egypt, pp 89–100, September
Faloutsos C (1996) Searching multimedia databases by content. Kluwer
Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. ACM SIGKDD Explor 2(1):51–57, August
Article Google Scholar
Filho RFS, Traina A, Traina Jr CT (2001) Faloutsos C Similarity search without tears: the OMNI family of all-purpose access methods. Proc. 17th IEEE int’l conf. on data engineering—ICDE. Heidelberg, Germany, pp 623–630, April
Gaede V, Günther O (1998) Multidimensional access methods. ACM Comput Surv 30(2):170–231, June
Article Google Scholar
Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theor Comp Sci 38:293–306
Article MATH Google Scholar
Gray J, Shenoy P (2000) Rules of thumb in data engineering. Proc. 16th IEEE int’l conf. on data engineering—ICDE. San Diego, CA, pp 3–12, April
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. Proc. ACM SIGMOD int’l conf. on management of data. Boston, MA, pp 47–57, June
Hjaltason GR, Samet H (1995) Ranking in spatial databases. Proc. 4th Symp. advances on spatial databases. Lecture notes in computer science, vol 951. Springer, pp 83–95
Hjaltason GR, Samet H (1999) Distance browsing in spatial databases. ACM Trans Database Syst 24(2):265–318
Article Google Scholar
Katayama N, Satoh S (1997) The SR-tree: an index structure for high-dimensional nearest neighbor queries. Proc. ACM SIGMOD int’l conf. on management of data. Tucson, AZ, pp 369–380, May
Katayama N, Satoh S (2002) Experimental evaluation of disk-based data structures for nearest neighbor searching. Data structures, near neighbor searches, and methodology: fifth and sixth DIMACS implementation challenges. AMS DIMACS series, vol 59, pp 87–104
Kim BS, Park SB (1986) A fast k nearest neighbor finding algorithm based on the ordered partition. Trans Pattern Anal Mach Intell 8(6):761–766, November
Article MATH Google Scholar
Korn F, Jagadish HV, Faloutsos C (1997) Efficiently supporting ad hoc queries in large datasets of time sequences. Proc. ACM SIGMOD conf. on management of data. Tucson, AZ, pp 289–300, May
Li C, Chang EY, Garcia-Molina H, Wiederhold G (2002) Clustering for approximate similarity search in high-dimensional spaces. IEEE Trans Knowl Data Eng 14(4):792–808, July/August
Article Google Scholar
Robinson JT (1981) The k-d-b tree: a search structure for large multidimensional dynamic indexes. Proc. ACM SIGMOD int’l conf. on management of data. Ann Arbor, MI, pp 10–18, May
Roussopoulos N, Kelley S, Vincent F (1998) Nearest neighbor queries. Proc. ACM SIGMOD int’l conf. on management of data. Seattle, WA, pp 154–165, June
Samet H (2006) Fundamentals of multidimensional and metric data structures. Elsevier
Thomasian A, Zhang L (2005) Optimizing the parameters of the ordered partition index for k-nearest-neighbor queries. Technical report ISL-05-01, integrated systems laboratory, computer science department, New Jersey institute of technology—NJIT, June
Thomasian A, Castelli V, Li CS (1998) Clustering and singular value decomposition for approximate indexing in high-dimensional spaces. Proc. conf. on information and knowledge management—CIKM’98. Gaithersburg, MD, pp 267–272, November
Thomasian A, Li Y, Zhang L (2003) Performance comparison of local dimensionality reduction methods. Technical report ISL-03-01, integrated systems laboratory, computer science department, New Jersey institute of technology—NJIT, June
Thomasian A, Li Y, Zhang L (2005) Exact k-NN queries on clustered SVD datasets. Inf Process Lett 94(6):247–252, July
Article MathSciNet Google Scholar
Weber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Proc. 24th int’l conf. on very large data bases—VLDB. New York, NY, pp 194–205, August
White DA, Jain R (1996) Similarity indexing: algorithms and performance. Storage and Retrieval for image and video database databases IV (Proc. SPIE vol. 2670). San Jose, CA, pp 62–73, January
White DA, Jain R (1996) Similarity indexing with the SS-tree. Proc. 12th IEEE int’l conf. on data eng. New Orleans, USA, pp 516–523, February
Zhang L (2005) High dimensional indexing methods utilizing clustering and dimensionality reduction. PhD thesis, computer science department, New Jersey institute of technology— NJIT, May
Zhou K, Ross KA (2004) Buffering database operations for enhanced instruction cache performance. Proc. ACM SIGMOD int’l conf. on management of data. Paris, France, pp 191–202, June

Download references

Author information

Authors and Affiliations

Computer Science Department, New Jersey Institute of Technology, Newark, NJ, 07102, USA
Alexander Thomasian & Lijuan Zhang
Thomasian and Associates, 17 Meadowbrook Road, Pleasantville, NY, 10570, USA
Alexander Thomasian
AMICAS Inc., Boston, MA, USA
Lijuan Zhang

Authors

Alexander Thomasian
View author publications
You can also search for this author in PubMed Google Scholar
Lijuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Thomasian.

Additional information

We acknowledge the support for A. Thomasian of NSF through Grant 0105485 in Computer Systems Architecture.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thomasian, A., Zhang, L. Persistent clustered main memory index for accelerating k-NN queries on high dimensional datasets. Multimed Tools Appl 38, 253–270 (2008). https://doi.org/10.1007/s11042-007-0179-7

Download citation

Published: 02 November 2007
Issue Date: June 2008
DOI: https://doi.org/10.1007/s11042-007-0179-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Persistent clustered main memory index for accelerating k-NN queries on high dimensional datasets

Abstract

Access this article

Similar content being viewed by others

Multidimensional scaling for big data

Redis-based full-text search extensions for relational databases

Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Persistent clustered main memory index for accelerating k-NN queries on high dimensional datasets

Abstract

Access this article

Similar content being viewed by others

Multidimensional scaling for big data

Redis-based full-text search extensions for relational databases

Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation