The VLDB Journal

, Volume 16, Issue 4, pp 483–505 | Cite as

The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient

  • Caetano TrainaJr.
  • Roberto F. Santos Filho
  • Agma J. M. Traina
  • Marcos R. Vieira
  • Christos Faloutsos
Regular Paper

Abstract

Similarity search operations require executing expensive algorithms, and although broadly useful in many new applications, they rely on specific structures not yet supported by commercial DBMS. In this paper we discuss the new Omni-technique, which allows to build a variety of dynamic Metric Access Methods based on a number of selected objects from the dataset, used as global reference objects. We call them as the Omni-family of metric access methods. This technique enables building similarity search operations on top of existing structures, significantly improving their performance, regarding the number of disk access and distance calculations. Additionally, our methods scale up well, exhibiting sub-linear behavior with growing database size.

Keywords

Similarity search Metric access methods Index structures Multimedia databases 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings of the 8th International Conference on Database Theory (ICDT). Lecture Notes in Computer Science, vol. 1973, pp. 420–434. Springer (2001).Google Scholar
  2. Annamalai, M., Chopra, R., De Fazio, S.: Indexing images inoracle8i. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 539–547. ACM Press (2000)Google Scholar
  3. Arantes, A.S., Vieira, M.R., Traina, A.J.M., Traina, C. Jr.: The fractal dimension making similarity queries more efficient. In: Proceedings of the II ACM SIGKDD Workshop on Fractals, Power Laws and Other Next Generation Data Mining Tools, pp. 12–17. Washington, USA (2003)Google Scholar
  4. Baeza-Yates, R.A., Cunto, W., Manber, U., Wu, S.: Proximity matching using fixed-queries trees. In: Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 807, pp. 198–212. Springer (1994)Google Scholar
  5. Beckmann, N.: Kriegel, H.-P., Schneider, R., Seeger, B.: The R*-Tree: an efficient and robust access method for points and rectangles. In: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, pp. 322–331. ACM Press (1990)Google Scholar
  6. Belussi, A., Faloutsos, C.: Estimating the selectivity of spatial queries using the ‘correlation’ fractal dimension. In: Proceedings of 21th International Conference on Very Large Data Bases (VLDB), pp. 299–310. Morgan Kaufmann (1995)Google Scholar
  7. Berman, A., Shapiro, L.G.: Selecting good keys fortriangle-inequality-based pruning algorithms. In: Proceedings of the International Workshop on Content-Based Access of Image and Video Databases (CAIVD), pp. 12–19. IEEE Computer Society (1998)Google Scholar
  8. Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: Proceedings of the 7th International Conference on Database Theory (ICDT). Lecture Notes in Computer Science. vol. 1540, pp. 217–235. Springer (1999)Google Scholar
  9. Bozkaya, T., Ózsoyoglu, Z. Meral.: Distance-based indexing for high-dimensional metric spaces. In: Proceedings of the 1997ACM SIGMOD International Conference on Management of Data, pp. 357–368. ACM Press (1997)Google Scholar
  10. Bozkaya, T., Ózsoyoglu, Z. Meral.: Indexing large metric spaces for similarity search queries. ACM Trans. Database Syst. (TODS) 24(3), 361–404 (1999)CrossRefGoogle Scholar
  11. Brin, S.: Near neighbor search in large metric spaces. In: Proceedings of 21th International Conference on Very Large DataBases (VLDB), pp. 574–584. Morgan Kaufmann (1995)Google Scholar
  12. Burkhard, W.A., Keller, R.M.: Some approaches to best-match filesearching. Commun. ACM (CACM) 16(4),230–236 (1973)CrossRefMATHGoogle Scholar
  13. Camastra, F., Vinciarelli, A.: Intrinsic dimension estimation of data: an approach based on Grassberger-Procaccia's algorithm. Neural. Process. Lett. 14(1), 27–34 (2001)CrossRefMATHGoogle Scholar
  14. Chávez, E., Marroquín, J.L., Baeza-Yates, R.A.: Spaghettis: An array based algorithm for similarity queries inmetric spaces. In: Proceeding of the String Processing and Information Retrieval Symposium & International Workshop on Groupware (SPIRE/CRIWG), pp. 38–46. IEEE Computer Society (1999)Google Scholar
  15. Chávez, E., Navarro, G., Baeza-Yates, R.A., Marroquín, J.L.: Searching in metric spaces. ACM Comput. Surveys 33(3), 273–321 (2001)CrossRefGoogle Scholar
  16. Ciaccia, P., Patella, M., Zezula, P.: M-Tree: An efficient access method for similarity search in metric spaces. In: Proceedings of 23rd International Conference on Very Large Data Bases (VLDB), Athens, Greece, pp. 426–435. Morgan Kaufmann Publishers (1997)Google Scholar
  17. de Sousa, E.P.M., Traina, C. Jr., Traina, A.J.M., Faloutsos, C.: How to use fractal dimension to find correlations between attributes. In: Proceeding of the First Workshop on Fractals and Self-Similarity in Data Mining: Issues and Approaches (in conjunction with 8th ACMSIGKDD International Conference on Knowledge Discovery & DataMining), Edmonton, Alberta, Canada, pp. 26–30. ACM Press (2002)Google Scholar
  18. Faloutsos, C., Seeger, B., Traina, A.J.M., Traina, C. Jr.: Spatialjoin selectivity using power laws. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 177–188, Dallas, USA. ACM Press (2000)Google Scholar
  19. Faragó, A., Linder, T., Lugosi, G.: Fast nearest-neighbor search in dissimilarity spaces. IEEE Trans. Pattern Anal. Mach. Intell.(TPAMI) 15(9), 957–962(1993)CrossRefGoogle Scholar
  20. Fu, Ada Wai-Chee, Chan, Polly Mei Shuen, Cheung, Yin-Ling, Moon, Yiu Sang.: Dynamic vp-Tree indexing for n-nearest neighbor search given pair-wise distances. VLDB J. 9(2), 154–173 (2000)CrossRefGoogle Scholar
  21. Gaede, V., Günther, O.: Multi dimensional access methods. ACM Comput. Surveys 30(2), 170–231 (1998)CrossRefGoogle Scholar
  22. Gennaro, C., Savino, P., Zezula, P.: A hashed schema forsimilarity search in metric spaces. In: Proceeding of the 1st DELOS Network of Excellence Workshop on Information Seeking, Searching and Querying in Digital Libraries, pp. 83–88. Zurich, Switzerland (2000)Google Scholar
  23. Guttman, A.: R-Tree : A dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, Boston, USA, pp. 47–57. ACM Press (1984)Google Scholar
  24. Hjaltason, G.R., Samet, H.: Index-driven similarity search inmetric spaces. ACM Trans. Database Syst. (TODS) 28(4), 517–580 (2003)CrossRefGoogle Scholar
  25. Ishikawa, M., Chen, H., Furuse, K., Yu, Jeffrey Xu, Ohbo, N.: Mb+tree: A dynamically updatable metric index for similarity searches. In: Proceedings of the First International Conference Web-Age Information Management (WAIM). Lecture Notes in Computer Science, vol. 1846, pp. 356–373. Springer (2000)Google Scholar
  26. Jin, Hui, Ooi, Beng Chin, Shen, Heng Tao, Yu, Cui, Zhou, Aoying.: An adaptive and efficient dimensionality reduction algorithm for high-dimensional indexing. In: Proceedings of the 19th International Conference on Data Engineering (ICDE), pp. 87–98. IEEE Computer Society (2003)Google Scholar
  27. Katayama, N., Satoh, S.: The SR-Tree: An index structure for high-dimensional nearest neighbor queries. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, pp. 369–380. ACM Press (1997)Google Scholar
  28. Korn, F., Pagel, Bernd-Uwe, Faloutsos, C.: On the ‘dimensionality curse’ and the ‘self-similarity blessing’. IEEE Trans. Knowledge Data Eng. (TKDE) 13(1), 96–111 (2001)CrossRefGoogle Scholar
  29. Koudas, N., Ooi, Beng Chin, Shen, Heng Tao, Tung, A.K.H.: Ldc: enabling search by partial distance in a hyper-dimensional space. In: Proceedings of the 20th International Conference on Data Engineering (ICDE), pp. 6–17. IEEE Computer Society (2004)Google Scholar
  30. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Cybernet. Control Theory 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  31. Lin, K.-I., Jagadish, H.V., Faloutsos, C.: The tv-tree: an index structure for high-dimensional data. VLDB J. 3(4), 517–542 (1994)CrossRefGoogle Scholar
  32. Micó, L., Oncina, J., Vidal, E.: A new version of the nearest-neighbour approximating and eliminating search algorithm (aesa) with linear preprocessing time and memory requirements. Pattern Recog. Lett. 15(1), 9–17 (1994)CrossRefGoogle Scholar
  33. Moreno-Seco, F., Micó, L., Oncina, J.: Extending laesa fastnearest neighbour algorithm to find the k nearest neighbours. In: Proceedings of the International Workshop of Structural, Syntactic, and Statistical Pattern Recognition (SSPR), Lecture Notes in Computer Science, vol. 2396, pp. 718–724. Springer(2002)Google Scholar
  34. Pagel, B.-U., Korn, F., Faloutsos, C.: Deflating the dimensionality curse using multiple fractal dimensions. In: Proceedings of the 16th International Conference on Data Engineering (ICDE), pp. 589–598. IEEE Computer Society (2000)Google Scholar
  35. Santos Filho, R.F., Traina, A.J.M., Traina, C. Jr., Faloutsos, C.: Similarity search without tears: the OMNI family of all-purpose access methods. In: Proceedings of the 17th International Conference on Data Engineering (ICDE), Heidelberg, Germany, pp. 623–630. IEEE Computer Society (2001)Google Scholar
  36. Schroeder, M.: Fractals, Chaos, Power Laws. W.H. Freeman &Company, New York, USA (1991)Google Scholar
  37. Sellis, T.K.: Nick Roussopoulos, and Christos Faloutsos. The R+-Tree: A dynamic index for multi-dimensional objects. In: Proceedings of 13th International Conference on Very Large Databases (VLDB), Brighton, England, pp. 507–518. Morgan Kaufmann Publishers (1987)Google Scholar
  38. Senior, A.: A combination fingerprint classifier. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 23(10), 1165–1175 (2001)CrossRefGoogle Scholar
  39. Traina, C., Agma, J.M. Jr., Faloutsos, C.: Distance exponent: a new concept for selectivity estimation in metric trees. In: Proceedings of the 16th International Conference on Data Engineering (ICDE), San Diego - CA, pp. 195. IEEE Computer Society (2000)Google Scholar
  40. Traina, A.J.M., Traina, C. Jr., Bueno, Josiane M., de Azevedo Marques, P.M.: The metric histogram: a new and effiretrieval. In: Proceedings of the Sixth IFIP Working Conference on Visual Database Systems (VDB), Brisbane, Australia, pp. 297–311. Kluwer Academic Publishers (2002)Google Scholar
  41. Traina, C. Jr., Traina, A.J.M., Faloutsos, C., Seeger, B.: Fast indexing and visualization of metric datasets using slim-Trees. IEEE Trans. Knowledge Data Eng. (TKDE) 14(2), 244–260 (2002)CrossRefGoogle Scholar
  42. Traina, C. Jr., Traina, A.J.M., Faloutsos, C.: Distance exponent:a new concept for selectivity estimation in metric trees. Research Paper CMU-CS-99-110, Carnegie Mellon University - School of Computer Science, Pittsburgh-PA USA, March 1999Google Scholar
  43. Traina, C. Jr., Traina, A.J.M., Seeger, B., Faloutsos, C.: Slim-Trees: High performance metric trees minimizing overlap between nodes. In: Proceedings of the International Conference on Extending Database Technology (EDBT). Lecture Notes in Computer Science, vol. 1777, pp. 51–65, Konstanz, Germany. Springer (2000)Google Scholar
  44. Traina, C. Jr., Traina, A.J.M., Wu, L., Faloutsos, C.: Fast feature selection using fractal dimension. In: XV Brazilian Database Symposium (SBBD), João Pessoa, Brazil, pp. 158–171 (2000)Google Scholar
  45. Uhlmann, J.K.: Satisfying general proximity/similarity querieswith metric trees. Inform. Process. Lett. 40(4), 175–179 (1991)MATHCrossRefGoogle Scholar
  46. Wactlar, H.D., Christel, M.G., Gong, Y., Hauptmann, A.G.: Lessons learned from building a terabyte digital video library. IEEE Comput. 32(2), 66–73 (1999)Google Scholar
  47. Weber R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of 24rd International Conference on Very Large Data Bases (VLDB), pp. 194–205 (1998)Google Scholar
  48. White, D.A., Jain, R.: Similarity indexing with the SS-Tree. In: Proceedings of the 12th International Conference on Data Engineering (ICDE), New Orleans, USA, pp. 516–523. IEEE Computer Society (1996)Google Scholar
  49. Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)MathSciNetMATHGoogle Scholar
  50. Yianilos, P.N.: Data structures and algorithms for nearestneighbor search in general metric spaces. In: Proceedings of the 4th Annual ACM/SIGACT-SIAM Symposium on Discrete Algorithms (SODA), Austin, USA, pp. 311–321 (1993)Google Scholar
  51. Yianilos, P.N.: Excluded middle vantage point forests for nearest neighbor search. Research paper, NEC Research Institute, Princeton, NJ, USA, Princeton, USA (1998)Google Scholar
  52. Yu, Cui, Ooi, Beng Chin, Tan, Kian-Lee, Jagadish, H.V.: Indexing the distance: an efficient method to knn processing. In: Proceedings of 27th International Conference on Very Large Data Bases (VLDB), pp. 421–430. Morgan Kaufmann (2001)Google Scholar

Copyright information

© Springer-Verlag 2006

Authors and Affiliations

  • Caetano TrainaJr.
    • 1
  • Roberto F. Santos Filho
    • 1
  • Agma J. M. Traina
    • 1
  • Marcos R. Vieira
    • 1
  • Christos Faloutsos
    • 2
  1. 1.Department of Computer Science, Drop and StatisticsUniversity of São Paulo at São CarlosSão CarlosBrazil
  2. 2.Department of Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations