Abstract
Similarity search is important in information retrieval applications where objects are usually represented as vectors of high dimensionality. This leads to the increasing need for supporting the indexing of high-dimensional data. On the other hand, indexing structures based on space partitioning are powerless because of the well-known “curse of dimensionality”. Linear scan of the data with approximation is more efficient in the high-dimensional similarity search. However, approaches so far have concentrated on reducing I/O, and ignored the computation cost. For an expensive distance function such as L p norm with fractional p, the computation cost becomes the bottleneck. We propose a new technique to address expensive distance functions by “indexing the function” by pre-computing some key values of the function once. Then, the values are used to develop the upper/lower bounds of the distance between a data vector and the query vector. The technique is extremely efficient since it avoids most of the distance function computations; moreover, it does not involve any extra secondary storage because no index is constructed and stored. The efficiency is confirmed by cost analysis, as well as experiments on synthetic and real data.
This is a preview of subscription content, access via your institution.
References
- 1
Weber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of 24th international conference on very large data bases, pp 194–205
- 2
Berchtold S, Böhm C, Keim D, Kriegel HP (1996) The X-tree: an index structure for high-dimensional data. In: Proceedings of 26th international conference on very large data bases, pp 28–39
- 3
Berchtold S, Ertl B, Keim DA, Kriegel HP, Seidl T (1998) Fast nearest neighbor search in high-dimensional space. In: Proceedings of the 14th international conference on data engineering, pp 209–218
- 4
Berchtold S, Keim D, Kriegel HP (1998) The pyramid-technique: Toward breaking the curse of dimensional data spaces. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, pp 142–153
- 5
Böhm C, Berchtold S, Keim DA (2001) Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases. ACM Comput Surv 33(3): 322–373
- 6
Ferhatosmanoglu H, Tuncel E, Agrawal D, Abbadi AE (2000) Vector approximation based indexing for non-uniform high dimensional data sets. In: Proceedings of the ACM international conference on information and knowledge management, pp 202–209
- 7
Berchtold S, Böhm C, Keim D, Kriegel HP (1997) A cost model for nearest neighbor search in high-dimensional data space. In: ACM PODS symposium on principles of database systems, pp 78–86
- 8
An J, Chen H, Furuse K, Ohbo N (2005) Cva-file: an index structure for high-dimensional datasets. Knowl Inf Syst J 7(3): 337–357
- 9
Chen H, An J, Furuse K, Ohbo N (2002) C2VA:trim high dimensional indexes. In: Procedings of WAIM2002, pp 303–315
- 10
Moise G, Zimek A, Kröger P, Kriegel HP, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3): 299–326
- 11
Song G, Cui B, Zheng B, Xie K, Yang D (2009) Accelerating sequence searching: dimensionality reduction method. Knowl Inf Syst 20(3): 301–322
- 12
Faloutsos C, Lin KI (1995) Fastmap: a fast algorithm for indexing,data mining and visualization of traditional and multimedia datasets. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, pp 163–174
- 13
Jolliffe I (1986) In: Principal component analysis. Springer, New York, NY
- 14
McCreight EM (1976) A space-economical suffix tree construction algorithm. J ACM 23(2): 262– 272
- 15
Aggarwal C, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings of the 8th international conference on database theory, pp 420–434
- 16
Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful. In: Proceedings of the 7th international conference on database theory, pp 217–235
- 17
Hinneburg A, Agrawal D, Keim DA (2000) What is the nearest neighbor in high dimensional spaces?. In: Proceedings of the 26th VLDB conference, pp 506–515
- 18
Yi B, Faloutsos C (2000) Fast time sequence indexing for arbitrary L p norms. In: Proceedings of 26th international conference on very large data bases, pp 385–394
- 19
Skopal T, Bustos B (2009) On index-free similarity search in metric spaces. In: DEXA, pp 516–531
- 20
Zhang Z, Ooi BC, Parthasarathy S, Tung AKH (2009) Similarity search on bregman divergence: towards non-metric indexing. PVLDB 2(1): 13–24
- 21
Chen H, Liu J, Furuse K, Yu JX, Ohbo N (2009) Indexing the function: an efficient algorithm for multi-dimensional search with expensive distance functions. In: ADMA, pp 67–78
- 22
Fan H, Zaïane OR, Foss A, Wu J (2009) Resolution-based outlier factor: detecting the top-most outlying data points in engineering data. Knowl Inf Syst 19(1): 31–51
- 23
Chen H, Yu X, Yamaguchi K, Kitagawa H, Ohbo N, Fujiwara Y (1992) Decomposition—an approach for optimizing queries including adt functions. Inf Process Lett 43(6): 327–333
- 24
Hellerstein JM (1998) Optimization techniques for queries with expensive methods. ACM Trans Database Syst (TODS) 23(2): 113–157
- 25
Gaede V, Gunther O (1998) Multidimensional access methods. ACM Comput Surv (CSUR) 30(2): 170–231
- 26
Papadias D, Tao Y, Mouratidis K, Hui CK (2005) Aggregate nearest neighbor queries in spatial databases. ACM Trans Database Syst (TODS) 30(2): 529–576
- 27
Tao Y, Yiu ML, Mamoulis N (2006) Reverse nearest neighbor search in metric spaces. IEEE Trans Knowl Data Eng 18(9): 1239–1252
- 28
Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, New York
- 29
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, pp 426–435
- 30
Traina C Jr, Traina AJM, Seeger B, Faloutsos C (2000) Slim-trees: high performance metric trees minimizing overlap between nodes. In: EDBT, pp 51–65
- 31
Zhang M, Alhajj R (2010) Effectiveness of naq-tree as index structure for similarity search in high-dimensional metric space. Knowl Inf Syst 22(1): 1–26
- 32
Pfitzner D, Leibbrandt R, Powers DMW (2009) Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Inf Syst 19(3): 361–394
Author information
Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, H., Liu, J., Furuse, K. et al. Indexing expensive functions for efficient multi-dimensional similarity search. Knowl Inf Syst 27, 165–192 (2011). https://doi.org/10.1007/s10115-010-0303-2
Received:
Revised:
Accepted:
Published:
Issue Date:
Keywords
- Similarity search
- High-dimensional space
- Function index