Bulk construction of dynamic clustered metric trees

Aronovich, Lior; Spiegler, Israel

doi:10.1007/s10115-009-0195-1

Bulk construction of dynamic clustered metric trees

Regular Paper
Published: 26 March 2009

Volume 22, pages 211–244, (2010)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Lior Aronovich¹ &
Israel Spiegler¹

103 Accesses
11 Citations
Explore all metrics

Abstract

Repositories of complex data types, such as images, audio, video and free text, are becoming increasingly frequent in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. An important class of access methods for similarity search in metric data is that of dynamic clustered metric trees, where the index is structured as a paged and balanced tree and the space is partitioned hierarchically into compact regions. While access methods of this class allow dynamic insertions typically of single objects, the problem of efficiently inserting a given data set into the index in bulk is largely open. In this article we address this problem and propose novel algorithms corresponding to its two cases, where the index is initially empty (i.e. bulk loading), and where the index is initially non empty (i.e. bulk insertion). The proposed bulk loading algorithm builds the index bottom-up layer by layer, using a new sampling based clustering method, which improves clustering results by improving the quality of the selected sample sets. The proposed bulk insertion algorithm employs the bulk loading algorithm to load the given data into a new index structure, and then merges the new and the existing structures into a unified high quality index, using a novel decomposition method to reduce overlaps between the structures. Both algorithms yield significantly improved construction and search performance, and are applicable to all dynamic clustered metric trees. Results from an extensive experimental study show that the proposed algorithms outperform alternative methods, reducing construction costs by up to 47% for CPU costs and 99% for I/O costs, and search costs by up to 48% for CPU costs and 30% for I/O costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agarwal PK, Arge L, Procopiuc O, Vitter JS (2001) A framework for index bulk loading and dynamization. In: Proceedings of international colloquium on automata, languages, and programming (ICALP), pp 115–127
Arge L (1995) The buffer tree: a new technique for optimal I/O algorithms. WADS 1995, pp 334–345
Arge L (2003) The buffer tree: a technique for designing batched external data structures. Algorithmica 37(1): 1–24
Article MATH MathSciNet Google Scholar
Arge L, Hinrichs K, Vahrenhold J, Vitter JS (2002) Efficient bulk operations on dynamic R-trees. Algorithmica 33(1): 104–128
Article MATH MathSciNet Google Scholar
Aronovich L, Spiegler I (2007) CM-tree: a dynamic clustered index for similarity search in metric databases. Data Knowl Eng 63(3): 919–946
Article Google Scholar
Athitsos V, Alon J, Sclaroff S, Kollios G (2004) BoostMap: a method for efficient approximate similarity rankings. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 268–275
Bayer R, McCreight EM (1972) Organization and maintenance of large ordered indexes. Acta Inf. 1(3): 173–189
Article Google Scholar
Beckman N, Kriegel HP, Schneider R, Seeger B (1990) The R ^*-tree: an efficient and robust access method for points and rectangles. In: Proceedings of ACM SIGMOD international conference on management of data, pp 322–331
Berchtold S, Böhm C, Kriegel HP (1998) Improving the query performance of high-dimensional index structures by bulk load operations. In: Schek HJ, Saltor F, Ramos I, Alonso G (eds) Advances in database technology (EDBT ’98). Sixth international conference on extending databases technology. Lecture notes in computer science, vol 1377. Springer, Berlin, pp 216–230
Bercken J, Seeger B, Widmayer P (1997) A generic approach to bulk loading multidimensional index structures. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA(eds) Proceedings of the 23rd international conference on very large databases (VLDB ’97). Morgan Kaufmann, San Mateo, pp 406–415
Google Scholar
Bercken J, Seeger B (2001) An evaluation of generic bulk loading techniques. In: Proceedings of the 27th VLDB Conference, Rome, Italy, pp 461–470
Bustos B, Navarro G, Chávez E (2003) Pivot selection techniques for proximity searching in metric spaces. Pattern Recognit Lett 24(14): 2357–2366
Article MATH Google Scholar
Chávez E, Navarro G, Baeza-Yates R, Marroquín JL (2001) Searching in metric spaces. ACM Comput Surv 33(3): 273–321
Article Google Scholar
Chen L, Choubey R, Rundensteiner EA (1998) Bulk-insertions into R-trees using the small-tree–large-tree approach. In: Proceedings of ACM international symposium on advances in geographic information systems, pp 161–162
Chen L, Choubey R, Rundensteiner EA (2002) Merging R-trees: efficient strategies for local bulk insertion. GeoInformatica 6: 7–34
Article MATH Google Scholar
Choubey R, Chen L, Rundensteiner EA (1999) GBI: a generalized R-tree bulk-insertion strategy. In: Symposium on large spatial databases, pp 91–108
Ciaccia P, Patella M (1998) Bulk loading the M-tree. In: Proceedings of the ninth Australasian database conference (ADC’98), Perth, Australia, pp 15–26
Ciaccia P, Patella M, Rabitti F, Zezula P (2002) The M-tree project. MultiMedia DataBase Group, Department of Electronics, Computer Science and Systems, University of Bologna. http://www-db.deis.unibo.it/Mtree/
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd conference on very large databases (VLDB’97), pp 426–435
Comer D (1979) The ubiquitous B-tree. ACM Comput Surv 11(2): 121–137
Article MATH Google Scholar
Faloutsos C, Kamel I (1995) Fastmap: a fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. ACM SIGMOD 24(2): 163–174
Article Google Scholar
Gaede V, Günther O (1998) Multidimensional access methods. ACM Comput Surv 30(2): 170–231
Article Google Scholar
Ghanem TM, Shah R, Mokbel MF, Aref WG, Vitter JS (2004) Bulk operations for space-partitioning trees. In: Proceedings of the 20th international conference on data engineering (ICDE 2004), pp 29–41
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 47–57
Hand D, Mannila H, Smyth P (2001) Principles of data mining. MIT Press, Cambridge
Google Scholar
Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4): 517–580
Article Google Scholar
Hettich S, Bay SD (1999) The UCI knowledge discovery in databases archive. Department of Information and Computer Science, University of California, Irvine, CA. http://kdd.ics.uci.edu
Jain AK, Dubes RC (1988) Algorithms for Clustering Data. Prentice-Hall, New Jersey
MATH Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323
Article Google Scholar
Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1): 17–40
Article Google Scholar
Kailing K, Kriegel HP, Pfeifle M, Schönauer S (2006) Extending metric index structures for efficient range query processing. Knowl Inf Syst 10(2): 211–227
Article Google Scholar
Kamel I, Faloutsos C (1993) On packing R-trees. In: Proceedings of the second international Conference on Information and Knowledge Management (CIKM), pp 490–499
Kamel I, Khalil M, Kouramajian V (1996) Bulk insertion in dynamic R-trees. In: Kraak M, Molenaar M (eds) Proceedings of the fourth international symposium on spatial data handling (SDH’96), pp 3B.31–3B.42
Kaufman L, Rousueeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Google Scholar
Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing. Knowl Inf Syst 12(1): 25–53
Article Google Scholar
Lee T, Moon B, Lee S (2006) Bulk insertion for R-trees by seeded clustering. Data Knowl Eng 59(1): 86–106
Article MathSciNet Google Scholar
Leutenegger ST, López MA, Edgington J (1997) STR: a simple and efficient algorithm for R-tree packing. In: Gray A, Larson PÅ(eds) Proceedings of the 13th international conference on data engineering. IEEE Computer Society Press, Los Alamitos, pp 497–506
Chapter Google Scholar
Papadopoulos A, Manolopoulos Y (2003) Parallel bulk-loading of spatial data. Parallel Comput 29(10): 1419–1444
Article MathSciNet Google Scholar
Skopal T, Pokorny J, Snasel V (2004) PM-tree: pivoting metric tree for similarity search in multimedia databases. In: Proceedings of the annual international workshop on databases, texts, specifications and objects (DATESO 2004), pp 27–37
Traina C Jr, Traina A, Faloutsos C, Seeger B (2002) Fast indexing and visualization of metric data sets using slim-trees. IEEE Trans Knowl Data Eng (TKDE) 14(2): 244–260
Article Google Scholar
Wang X, Wang JTL, Lin KI, Shasha D, Shapiro BA, Zhang K (2000) An index structure for data mining and clustering. Knowl Inf Syst 2(2): 161–184
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Information Systems Department, Tel Aviv University, Tel Aviv, Israel
Lior Aronovich & Israel Spiegler

Authors

Lior Aronovich
View author publications
You can also search for this author in PubMed Google Scholar
Israel Spiegler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lior Aronovich.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aronovich, L., Spiegler, I. Bulk construction of dynamic clustered metric trees. Knowl Inf Syst 22, 211–244 (2010). https://doi.org/10.1007/s10115-009-0195-1

Download citation

Received: 18 November 2007
Revised: 18 December 2008
Accepted: 11 January 2009
Published: 26 March 2009
Issue Date: February 2010
DOI: https://doi.org/10.1007/s10115-009-0195-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bulk construction of dynamic clustered metric trees

Abstract

Access this article

Similar content being viewed by others

MX-tree: A Double Hierarchical Metric Index with Overlap Reduction

Index Maintenance Strategy and Cost Model for Extended Cluster Pruning

LOH and Behold: Web-Scale Visual Search, Recommendation and Clustering Using Locally Optimized Hashing

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bulk construction of dynamic clustered metric trees

Abstract

Access this article

Similar content being viewed by others

MX-tree: A Double Hierarchical Metric Index with Overlap Reduction

Index Maintenance Strategy and Cost Model for Extended Cluster Pruning

LOH and Behold: Web-Scale Visual Search, Recommendation and Clustering Using Locally Optimized Hashing

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation