Skip to main content
Log in

Persistent clustered main memory index for accelerating k-NN queries on high dimensional datasets

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Similarity search implemented via k-nearest neighbor— k-NN queries on multidimensional indices is an extremely useful paradigm for content-based image retrieval. As the dimensionality of feature vectors increases the curse of dimensionality sets in, i.e., the performance of k-NN search of disk-resident indices in the R-tree family degrades rapidly due to the overlap in index pages in high dimensions. This problem is dealt with in this study by utilizing the double filtering effect of clustering and indexing. The clustering algorithm ensures that the largest cluster fits into main memory and that only clusters closest to a query point need to be searched and hence loaded into main memory. We organize the data in each cluster according to the ordered-partition—OP-tree main memory resident index, which is not prone to the curse of dimensionality and highly efficient for processing k-NN queries. We serialize an OP-tree by writing its dynamically allocated nodes into contiguous memory locations, optimize its parameters, and make it persistent by writing it to disk. The time to read and write clusters constituting an OP-tree with a single sequential access to disk benefits from higher data transfer rates of modern disk drives. The performance of the index is further improved by applying the Karhunen–Loève transformation—KLT to the dataset, since this results in a more efficient computation of distances for k-NN queries. We compare OP-trees and sequential scans with and without a KL-transformation and with and without using a shortcut method in calculating Euclidean distances. A comparison against the OMNI-sequential scan is also reported. We finally compare a clustered and persistent version of the OP-tree against a clustered version of the SR-tree and the VA-file method. CPU time is measured and elapsed time is estimated in this study. It is observed that the OP-tree index outperforms the other two methods and that the improvement increases with the number of dimensions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Barbará D (1997) The New Jersey data reduction report. IEEE Data Eng Bull 20(4):3–45

    Google Scholar 

  2. Bentley JL (1979) Multidimensional binary search in database applications. IEEE Trans Software Eng 4(5):333–340

    Article  MathSciNet  Google Scholar 

  3. Beckmann N, Kriegel HP, Schnieder R, Seeger B (1990) The R *-tree: an efficient and robust access method for points and rectangles. Proc. ACM SIGMOD int’l conf. on management of data. Atlantic City, NJ, pp 322–331, May

  4. Berchtold S, Keim DA, Kriegel HP (1996) The X-tree: an index structure for high-dimensional data. Proc. 22nd int’l conf. on very large data bases—VLDB. San Jose, CA, pp 28–39, August

  5. Böhm C, Kriegel H-P (2000) Dynamically optimizing high-dimensional index structures. Proc. 7th int’l conf. on extending database technology—EDBT, Konstanz, Germany, pp 36–50, March

  6. Böhm C, Berchtold S, Keim DA (2001) Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases. ACM Comput Surv 33(3):322–373

    Article  Google Scholar 

  7. Castelli V (2002) Multidimensional indexing structures for content-based retrieval. In: Castelli V, Bergman L (eds) Image databases: search and retrieval of digital imagery. Wiley, pp 373–434

  8. Castelli V, Bergman L (eds) (2002) Image databases: search and retrieval of digital imagery. Wiley

  9. Castelli V, Thomasian A, Li C-S (2003) CSVD: clustering and singular value decomposition for approximate similarity search in high-dimensional spaces. IEEE Trans Knowl Data Eng 15(3):671–685, May/June

    Article  Google Scholar 

  10. Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. Proc. 26th int’l conf. on very large data bases—VLDB. Cairo, Egypt, pp 89–100, September

  11. Faloutsos C (1996) Searching multimedia databases by content. Kluwer

  12. Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. ACM SIGKDD Explor 2(1):51–57, August

    Article  Google Scholar 

  13. Filho RFS, Traina A, Traina Jr CT (2001) Faloutsos C Similarity search without tears: the OMNI family of all-purpose access methods. Proc. 17th IEEE int’l conf. on data engineering—ICDE. Heidelberg, Germany, pp 623–630, April

  14. Gaede V, Günther O (1998) Multidimensional access methods. ACM Comput Surv 30(2):170–231, June

    Article  Google Scholar 

  15. Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theor Comp Sci 38:293–306

    Article  MATH  Google Scholar 

  16. Gray J, Shenoy P (2000) Rules of thumb in data engineering. Proc. 16th IEEE int’l conf. on data engineering—ICDE. San Diego, CA, pp 3–12, April

  17. Guttman A (1984) R-trees: a dynamic index structure for spatial searching. Proc. ACM SIGMOD int’l conf. on management of data. Boston, MA, pp 47–57, June

  18. Hjaltason GR, Samet H (1995) Ranking in spatial databases. Proc. 4th Symp. advances on spatial databases. Lecture notes in computer science, vol 951. Springer, pp 83–95

  19. Hjaltason GR, Samet H (1999) Distance browsing in spatial databases. ACM Trans Database Syst 24(2):265–318

    Article  Google Scholar 

  20. Katayama N, Satoh S (1997) The SR-tree: an index structure for high-dimensional nearest neighbor queries. Proc. ACM SIGMOD int’l conf. on management of data. Tucson, AZ, pp 369–380, May

  21. Katayama N, Satoh S (2002) Experimental evaluation of disk-based data structures for nearest neighbor searching. Data structures, near neighbor searches, and methodology: fifth and sixth DIMACS implementation challenges. AMS DIMACS series, vol 59, pp 87–104

  22. Kim BS, Park SB (1986) A fast k nearest neighbor finding algorithm based on the ordered partition. Trans Pattern Anal Mach Intell 8(6):761–766, November

    Article  MATH  Google Scholar 

  23. Korn F, Jagadish HV, Faloutsos C (1997) Efficiently supporting ad hoc queries in large datasets of time sequences. Proc. ACM SIGMOD conf. on management of data. Tucson, AZ, pp 289–300, May

  24. Li C, Chang EY, Garcia-Molina H, Wiederhold G (2002) Clustering for approximate similarity search in high-dimensional spaces. IEEE Trans Knowl Data Eng 14(4):792–808, July/August

    Article  Google Scholar 

  25. Robinson JT (1981) The k-d-b tree: a search structure for large multidimensional dynamic indexes. Proc. ACM SIGMOD int’l conf. on management of data. Ann Arbor, MI, pp 10–18, May

  26. Roussopoulos N, Kelley S, Vincent F (1998) Nearest neighbor queries. Proc. ACM SIGMOD int’l conf. on management of data. Seattle, WA, pp 154–165, June

  27. Samet H (2006) Fundamentals of multidimensional and metric data structures. Elsevier

  28. Thomasian A, Zhang L (2005) Optimizing the parameters of the ordered partition index for k-nearest-neighbor queries. Technical report ISL-05-01, integrated systems laboratory, computer science department, New Jersey institute of technology—NJIT, June

  29. Thomasian A, Castelli V, Li CS (1998) Clustering and singular value decomposition for approximate indexing in high-dimensional spaces. Proc. conf. on information and knowledge management—CIKM’98. Gaithersburg, MD, pp 267–272, November

  30. Thomasian A, Li Y, Zhang L (2003) Performance comparison of local dimensionality reduction methods. Technical report ISL-03-01, integrated systems laboratory, computer science department, New Jersey institute of technology—NJIT, June

  31. Thomasian A, Li Y, Zhang L (2005) Exact k-NN queries on clustered SVD datasets. Inf Process Lett 94(6):247–252, July

    Article  MathSciNet  Google Scholar 

  32. Weber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Proc. 24th int’l conf. on very large data bases—VLDB. New York, NY, pp 194–205, August

  33. White DA, Jain R (1996) Similarity indexing: algorithms and performance. Storage and Retrieval for image and video database databases IV (Proc. SPIE vol. 2670). San Jose, CA, pp 62–73, January

  34. White DA, Jain R (1996) Similarity indexing with the SS-tree. Proc. 12th IEEE int’l conf. on data eng. New Orleans, USA, pp 516–523, February

  35. Zhang L (2005) High dimensional indexing methods utilizing clustering and dimensionality reduction. PhD thesis, computer science department, New Jersey institute of technology— NJIT, May

  36. Zhou K, Ross KA (2004) Buffering database operations for enhanced instruction cache performance. Proc. ACM SIGMOD int’l conf. on management of data. Paris, France, pp 191–202, June

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Thomasian.

Additional information

We acknowledge the support for A. Thomasian of NSF through Grant 0105485 in Computer Systems Architecture.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thomasian, A., Zhang, L. Persistent clustered main memory index for accelerating k-NN queries on high dimensional datasets. Multimed Tools Appl 38, 253–270 (2008). https://doi.org/10.1007/s11042-007-0179-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-007-0179-7

Keywords

Navigation