Abstract
Similarity search implemented via k-nearest neighbor— k-NN queries on multidimensional indices is an extremely useful paradigm for content-based image retrieval. As the dimensionality of feature vectors increases the curse of dimensionality sets in, i.e., the performance of k-NN search of disk-resident indices in the R-tree family degrades rapidly due to the overlap in index pages in high dimensions. This problem is dealt with in this study by utilizing the double filtering effect of clustering and indexing. The clustering algorithm ensures that the largest cluster fits into main memory and that only clusters closest to a query point need to be searched and hence loaded into main memory. We organize the data in each cluster according to the ordered-partition—OP-tree main memory resident index, which is not prone to the curse of dimensionality and highly efficient for processing k-NN queries. We serialize an OP-tree by writing its dynamically allocated nodes into contiguous memory locations, optimize its parameters, and make it persistent by writing it to disk. The time to read and write clusters constituting an OP-tree with a single sequential access to disk benefits from higher data transfer rates of modern disk drives. The performance of the index is further improved by applying the Karhunen–Loève transformation—KLT to the dataset, since this results in a more efficient computation of distances for k-NN queries. We compare OP-trees and sequential scans with and without a KL-transformation and with and without using a shortcut method in calculating Euclidean distances. A comparison against the OMNI-sequential scan is also reported. We finally compare a clustered and persistent version of the OP-tree against a clustered version of the SR-tree and the VA-file method. CPU time is measured and elapsed time is estimated in this study. It is observed that the OP-tree index outperforms the other two methods and that the improvement increases with the number of dimensions.
Similar content being viewed by others
References
Barbará D (1997) The New Jersey data reduction report. IEEE Data Eng Bull 20(4):3–45
Bentley JL (1979) Multidimensional binary search in database applications. IEEE Trans Software Eng 4(5):333–340
Beckmann N, Kriegel HP, Schnieder R, Seeger B (1990) The R *-tree: an efficient and robust access method for points and rectangles. Proc. ACM SIGMOD int’l conf. on management of data. Atlantic City, NJ, pp 322–331, May
Berchtold S, Keim DA, Kriegel HP (1996) The X-tree: an index structure for high-dimensional data. Proc. 22nd int’l conf. on very large data bases—VLDB. San Jose, CA, pp 28–39, August
Böhm C, Kriegel H-P (2000) Dynamically optimizing high-dimensional index structures. Proc. 7th int’l conf. on extending database technology—EDBT, Konstanz, Germany, pp 36–50, March
Böhm C, Berchtold S, Keim DA (2001) Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases. ACM Comput Surv 33(3):322–373
Castelli V (2002) Multidimensional indexing structures for content-based retrieval. In: Castelli V, Bergman L (eds) Image databases: search and retrieval of digital imagery. Wiley, pp 373–434
Castelli V, Bergman L (eds) (2002) Image databases: search and retrieval of digital imagery. Wiley
Castelli V, Thomasian A, Li C-S (2003) CSVD: clustering and singular value decomposition for approximate similarity search in high-dimensional spaces. IEEE Trans Knowl Data Eng 15(3):671–685, May/June
Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. Proc. 26th int’l conf. on very large data bases—VLDB. Cairo, Egypt, pp 89–100, September
Faloutsos C (1996) Searching multimedia databases by content. Kluwer
Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. ACM SIGKDD Explor 2(1):51–57, August
Filho RFS, Traina A, Traina Jr CT (2001) Faloutsos C Similarity search without tears: the OMNI family of all-purpose access methods. Proc. 17th IEEE int’l conf. on data engineering—ICDE. Heidelberg, Germany, pp 623–630, April
Gaede V, Günther O (1998) Multidimensional access methods. ACM Comput Surv 30(2):170–231, June
Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theor Comp Sci 38:293–306
Gray J, Shenoy P (2000) Rules of thumb in data engineering. Proc. 16th IEEE int’l conf. on data engineering—ICDE. San Diego, CA, pp 3–12, April
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. Proc. ACM SIGMOD int’l conf. on management of data. Boston, MA, pp 47–57, June
Hjaltason GR, Samet H (1995) Ranking in spatial databases. Proc. 4th Symp. advances on spatial databases. Lecture notes in computer science, vol 951. Springer, pp 83–95
Hjaltason GR, Samet H (1999) Distance browsing in spatial databases. ACM Trans Database Syst 24(2):265–318
Katayama N, Satoh S (1997) The SR-tree: an index structure for high-dimensional nearest neighbor queries. Proc. ACM SIGMOD int’l conf. on management of data. Tucson, AZ, pp 369–380, May
Katayama N, Satoh S (2002) Experimental evaluation of disk-based data structures for nearest neighbor searching. Data structures, near neighbor searches, and methodology: fifth and sixth DIMACS implementation challenges. AMS DIMACS series, vol 59, pp 87–104
Kim BS, Park SB (1986) A fast k nearest neighbor finding algorithm based on the ordered partition. Trans Pattern Anal Mach Intell 8(6):761–766, November
Korn F, Jagadish HV, Faloutsos C (1997) Efficiently supporting ad hoc queries in large datasets of time sequences. Proc. ACM SIGMOD conf. on management of data. Tucson, AZ, pp 289–300, May
Li C, Chang EY, Garcia-Molina H, Wiederhold G (2002) Clustering for approximate similarity search in high-dimensional spaces. IEEE Trans Knowl Data Eng 14(4):792–808, July/August
Robinson JT (1981) The k-d-b tree: a search structure for large multidimensional dynamic indexes. Proc. ACM SIGMOD int’l conf. on management of data. Ann Arbor, MI, pp 10–18, May
Roussopoulos N, Kelley S, Vincent F (1998) Nearest neighbor queries. Proc. ACM SIGMOD int’l conf. on management of data. Seattle, WA, pp 154–165, June
Samet H (2006) Fundamentals of multidimensional and metric data structures. Elsevier
Thomasian A, Zhang L (2005) Optimizing the parameters of the ordered partition index for k-nearest-neighbor queries. Technical report ISL-05-01, integrated systems laboratory, computer science department, New Jersey institute of technology—NJIT, June
Thomasian A, Castelli V, Li CS (1998) Clustering and singular value decomposition for approximate indexing in high-dimensional spaces. Proc. conf. on information and knowledge management—CIKM’98. Gaithersburg, MD, pp 267–272, November
Thomasian A, Li Y, Zhang L (2003) Performance comparison of local dimensionality reduction methods. Technical report ISL-03-01, integrated systems laboratory, computer science department, New Jersey institute of technology—NJIT, June
Thomasian A, Li Y, Zhang L (2005) Exact k-NN queries on clustered SVD datasets. Inf Process Lett 94(6):247–252, July
Weber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Proc. 24th int’l conf. on very large data bases—VLDB. New York, NY, pp 194–205, August
White DA, Jain R (1996) Similarity indexing: algorithms and performance. Storage and Retrieval for image and video database databases IV (Proc. SPIE vol. 2670). San Jose, CA, pp 62–73, January
White DA, Jain R (1996) Similarity indexing with the SS-tree. Proc. 12th IEEE int’l conf. on data eng. New Orleans, USA, pp 516–523, February
Zhang L (2005) High dimensional indexing methods utilizing clustering and dimensionality reduction. PhD thesis, computer science department, New Jersey institute of technology— NJIT, May
Zhou K, Ross KA (2004) Buffering database operations for enhanced instruction cache performance. Proc. ACM SIGMOD int’l conf. on management of data. Paris, France, pp 191–202, June
Author information
Authors and Affiliations
Corresponding author
Additional information
We acknowledge the support for A. Thomasian of NSF through Grant 0105485 in Computer Systems Architecture.
Rights and permissions
About this article
Cite this article
Thomasian, A., Zhang, L. Persistent clustered main memory index for accelerating k-NN queries on high dimensional datasets. Multimed Tools Appl 38, 253–270 (2008). https://doi.org/10.1007/s11042-007-0179-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-007-0179-7