Abstract
Repositories of complex data types, such as images, audio, video and free text, are becoming increasingly frequent in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. An important class of access methods for similarity search in metric data is that of dynamic clustered metric trees, where the index is structured as a paged and balanced tree and the space is partitioned hierarchically into compact regions. While access methods of this class allow dynamic insertions typically of single objects, the problem of efficiently inserting a given data set into the index in bulk is largely open. In this article we address this problem and propose novel algorithms corresponding to its two cases, where the index is initially empty (i.e. bulk loading), and where the index is initially non empty (i.e. bulk insertion). The proposed bulk loading algorithm builds the index bottom-up layer by layer, using a new sampling based clustering method, which improves clustering results by improving the quality of the selected sample sets. The proposed bulk insertion algorithm employs the bulk loading algorithm to load the given data into a new index structure, and then merges the new and the existing structures into a unified high quality index, using a novel decomposition method to reduce overlaps between the structures. Both algorithms yield significantly improved construction and search performance, and are applicable to all dynamic clustered metric trees. Results from an extensive experimental study show that the proposed algorithms outperform alternative methods, reducing construction costs by up to 47% for CPU costs and 99% for I/O costs, and search costs by up to 48% for CPU costs and 30% for I/O costs.
Similar content being viewed by others
References
Agarwal PK, Arge L, Procopiuc O, Vitter JS (2001) A framework for index bulk loading and dynamization. In: Proceedings of international colloquium on automata, languages, and programming (ICALP), pp 115–127
Arge L (1995) The buffer tree: a new technique for optimal I/O algorithms. WADS 1995, pp 334–345
Arge L (2003) The buffer tree: a technique for designing batched external data structures. Algorithmica 37(1): 1–24
Arge L, Hinrichs K, Vahrenhold J, Vitter JS (2002) Efficient bulk operations on dynamic R-trees. Algorithmica 33(1): 104–128
Aronovich L, Spiegler I (2007) CM-tree: a dynamic clustered index for similarity search in metric databases. Data Knowl Eng 63(3): 919–946
Athitsos V, Alon J, Sclaroff S, Kollios G (2004) BoostMap: a method for efficient approximate similarity rankings. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 268–275
Bayer R, McCreight EM (1972) Organization and maintenance of large ordered indexes. Acta Inf. 1(3): 173–189
Beckman N, Kriegel HP, Schneider R, Seeger B (1990) The R *-tree: an efficient and robust access method for points and rectangles. In: Proceedings of ACM SIGMOD international conference on management of data, pp 322–331
Berchtold S, Böhm C, Kriegel HP (1998) Improving the query performance of high-dimensional index structures by bulk load operations. In: Schek HJ, Saltor F, Ramos I, Alonso G (eds) Advances in database technology (EDBT ’98). Sixth international conference on extending databases technology. Lecture notes in computer science, vol 1377. Springer, Berlin, pp 216–230
Bercken J, Seeger B, Widmayer P (1997) A generic approach to bulk loading multidimensional index structures. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA(eds) Proceedings of the 23rd international conference on very large databases (VLDB ’97). Morgan Kaufmann, San Mateo, pp 406–415
Bercken J, Seeger B (2001) An evaluation of generic bulk loading techniques. In: Proceedings of the 27th VLDB Conference, Rome, Italy, pp 461–470
Bustos B, Navarro G, Chávez E (2003) Pivot selection techniques for proximity searching in metric spaces. Pattern Recognit Lett 24(14): 2357–2366
Chávez E, Navarro G, Baeza-Yates R, Marroquín JL (2001) Searching in metric spaces. ACM Comput Surv 33(3): 273–321
Chen L, Choubey R, Rundensteiner EA (1998) Bulk-insertions into R-trees using the small-tree–large-tree approach. In: Proceedings of ACM international symposium on advances in geographic information systems, pp 161–162
Chen L, Choubey R, Rundensteiner EA (2002) Merging R-trees: efficient strategies for local bulk insertion. GeoInformatica 6: 7–34
Choubey R, Chen L, Rundensteiner EA (1999) GBI: a generalized R-tree bulk-insertion strategy. In: Symposium on large spatial databases, pp 91–108
Ciaccia P, Patella M (1998) Bulk loading the M-tree. In: Proceedings of the ninth Australasian database conference (ADC’98), Perth, Australia, pp 15–26
Ciaccia P, Patella M, Rabitti F, Zezula P (2002) The M-tree project. MultiMedia DataBase Group, Department of Electronics, Computer Science and Systems, University of Bologna. http://www-db.deis.unibo.it/Mtree/
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd conference on very large databases (VLDB’97), pp 426–435
Comer D (1979) The ubiquitous B-tree. ACM Comput Surv 11(2): 121–137
Faloutsos C, Kamel I (1995) Fastmap: a fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. ACM SIGMOD 24(2): 163–174
Gaede V, Günther O (1998) Multidimensional access methods. ACM Comput Surv 30(2): 170–231
Ghanem TM, Shah R, Mokbel MF, Aref WG, Vitter JS (2004) Bulk operations for space-partitioning trees. In: Proceedings of the 20th international conference on data engineering (ICDE 2004), pp 29–41
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 47–57
Hand D, Mannila H, Smyth P (2001) Principles of data mining. MIT Press, Cambridge
Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4): 517–580
Hettich S, Bay SD (1999) The UCI knowledge discovery in databases archive. Department of Information and Computer Science, University of California, Irvine, CA. http://kdd.ics.uci.edu
Jain AK, Dubes RC (1988) Algorithms for Clustering Data. Prentice-Hall, New Jersey
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323
Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1): 17–40
Kailing K, Kriegel HP, Pfeifle M, Schönauer S (2006) Extending metric index structures for efficient range query processing. Knowl Inf Syst 10(2): 211–227
Kamel I, Faloutsos C (1993) On packing R-trees. In: Proceedings of the second international Conference on Information and Knowledge Management (CIKM), pp 490–499
Kamel I, Khalil M, Kouramajian V (1996) Bulk insertion in dynamic R-trees. In: Kraak M, Molenaar M (eds) Proceedings of the fourth international symposium on spatial data handling (SDH’96), pp 3B.31–3B.42
Kaufman L, Rousueeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing. Knowl Inf Syst 12(1): 25–53
Lee T, Moon B, Lee S (2006) Bulk insertion for R-trees by seeded clustering. Data Knowl Eng 59(1): 86–106
Leutenegger ST, López MA, Edgington J (1997) STR: a simple and efficient algorithm for R-tree packing. In: Gray A, Larson PÅ(eds) Proceedings of the 13th international conference on data engineering. IEEE Computer Society Press, Los Alamitos, pp 497–506
Papadopoulos A, Manolopoulos Y (2003) Parallel bulk-loading of spatial data. Parallel Comput 29(10): 1419–1444
Skopal T, Pokorny J, Snasel V (2004) PM-tree: pivoting metric tree for similarity search in multimedia databases. In: Proceedings of the annual international workshop on databases, texts, specifications and objects (DATESO 2004), pp 27–37
Traina C Jr, Traina A, Faloutsos C, Seeger B (2002) Fast indexing and visualization of metric data sets using slim-trees. IEEE Trans Knowl Data Eng (TKDE) 14(2): 244–260
Wang X, Wang JTL, Lin KI, Shasha D, Shapiro BA, Zhang K (2000) An index structure for data mining and clustering. Knowl Inf Syst 2(2): 161–184
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aronovich, L., Spiegler, I. Bulk construction of dynamic clustered metric trees. Knowl Inf Syst 22, 211–244 (2010). https://doi.org/10.1007/s10115-009-0195-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0195-1