Skip to main content
Log in

Bulk construction of dynamic clustered metric trees

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Repositories of complex data types, such as images, audio, video and free text, are becoming increasingly frequent in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. An important class of access methods for similarity search in metric data is that of dynamic clustered metric trees, where the index is structured as a paged and balanced tree and the space is partitioned hierarchically into compact regions. While access methods of this class allow dynamic insertions typically of single objects, the problem of efficiently inserting a given data set into the index in bulk is largely open. In this article we address this problem and propose novel algorithms corresponding to its two cases, where the index is initially empty (i.e. bulk loading), and where the index is initially non empty (i.e. bulk insertion). The proposed bulk loading algorithm builds the index bottom-up layer by layer, using a new sampling based clustering method, which improves clustering results by improving the quality of the selected sample sets. The proposed bulk insertion algorithm employs the bulk loading algorithm to load the given data into a new index structure, and then merges the new and the existing structures into a unified high quality index, using a novel decomposition method to reduce overlaps between the structures. Both algorithms yield significantly improved construction and search performance, and are applicable to all dynamic clustered metric trees. Results from an extensive experimental study show that the proposed algorithms outperform alternative methods, reducing construction costs by up to 47% for CPU costs and 99% for I/O costs, and search costs by up to 48% for CPU costs and 30% for I/O costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agarwal PK, Arge L, Procopiuc O, Vitter JS (2001) A framework for index bulk loading and dynamization. In: Proceedings of international colloquium on automata, languages, and programming (ICALP), pp 115–127

  2. Arge L (1995) The buffer tree: a new technique for optimal I/O algorithms. WADS 1995, pp 334–345

  3. Arge L (2003) The buffer tree: a technique for designing batched external data structures. Algorithmica 37(1): 1–24

    Article  MATH  MathSciNet  Google Scholar 

  4. Arge L, Hinrichs K, Vahrenhold J, Vitter JS (2002) Efficient bulk operations on dynamic R-trees. Algorithmica 33(1): 104–128

    Article  MATH  MathSciNet  Google Scholar 

  5. Aronovich L, Spiegler I (2007) CM-tree: a dynamic clustered index for similarity search in metric databases. Data Knowl Eng 63(3): 919–946

    Article  Google Scholar 

  6. Athitsos V, Alon J, Sclaroff S, Kollios G (2004) BoostMap: a method for efficient approximate similarity rankings. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 268–275

  7. Bayer R, McCreight EM (1972) Organization and maintenance of large ordered indexes. Acta Inf. 1(3): 173–189

    Article  Google Scholar 

  8. Beckman N, Kriegel HP, Schneider R, Seeger B (1990) The R *-tree: an efficient and robust access method for points and rectangles. In: Proceedings of ACM SIGMOD international conference on management of data, pp 322–331

  9. Berchtold S, Böhm C, Kriegel HP (1998) Improving the query performance of high-dimensional index structures by bulk load operations. In: Schek HJ, Saltor F, Ramos I, Alonso G (eds) Advances in database technology (EDBT ’98). Sixth international conference on extending databases technology. Lecture notes in computer science, vol 1377. Springer, Berlin, pp 216–230

  10. Bercken J, Seeger B, Widmayer P (1997) A generic approach to bulk loading multidimensional index structures. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA(eds) Proceedings of the 23rd international conference on very large databases (VLDB ’97). Morgan Kaufmann, San Mateo, pp 406–415

    Google Scholar 

  11. Bercken J, Seeger B (2001) An evaluation of generic bulk loading techniques. In: Proceedings of the 27th VLDB Conference, Rome, Italy, pp 461–470

  12. Bustos B, Navarro G, Chávez E (2003) Pivot selection techniques for proximity searching in metric spaces. Pattern Recognit Lett 24(14): 2357–2366

    Article  MATH  Google Scholar 

  13. Chávez E, Navarro G, Baeza-Yates R, Marroquín JL (2001) Searching in metric spaces. ACM Comput Surv 33(3): 273–321

    Article  Google Scholar 

  14. Chen L, Choubey R, Rundensteiner EA (1998) Bulk-insertions into R-trees using the small-tree–large-tree approach. In: Proceedings of ACM international symposium on advances in geographic information systems, pp 161–162

  15. Chen L, Choubey R, Rundensteiner EA (2002) Merging R-trees: efficient strategies for local bulk insertion. GeoInformatica 6: 7–34

    Article  MATH  Google Scholar 

  16. Choubey R, Chen L, Rundensteiner EA (1999) GBI: a generalized R-tree bulk-insertion strategy. In: Symposium on large spatial databases, pp 91–108

  17. Ciaccia P, Patella M (1998) Bulk loading the M-tree. In: Proceedings of the ninth Australasian database conference (ADC’98), Perth, Australia, pp 15–26

  18. Ciaccia P, Patella M, Rabitti F, Zezula P (2002) The M-tree project. MultiMedia DataBase Group, Department of Electronics, Computer Science and Systems, University of Bologna. http://www-db.deis.unibo.it/Mtree/

  19. Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd conference on very large databases (VLDB’97), pp 426–435

  20. Comer D (1979) The ubiquitous B-tree. ACM Comput Surv 11(2): 121–137

    Article  MATH  Google Scholar 

  21. Faloutsos C, Kamel I (1995) Fastmap: a fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. ACM SIGMOD 24(2): 163–174

    Article  Google Scholar 

  22. Gaede V, Günther O (1998) Multidimensional access methods. ACM Comput Surv 30(2): 170–231

    Article  Google Scholar 

  23. Ghanem TM, Shah R, Mokbel MF, Aref WG, Vitter JS (2004) Bulk operations for space-partitioning trees. In: Proceedings of the 20th international conference on data engineering (ICDE 2004), pp 29–41

  24. Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 47–57

  25. Hand D, Mannila H, Smyth P (2001) Principles of data mining. MIT Press, Cambridge

    Google Scholar 

  26. Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4): 517–580

    Article  Google Scholar 

  27. Hettich S, Bay SD (1999) The UCI knowledge discovery in databases archive. Department of Information and Computer Science, University of California, Irvine, CA. http://kdd.ics.uci.edu

  28. Jain AK, Dubes RC (1988) Algorithms for Clustering Data. Prentice-Hall, New Jersey

    MATH  Google Scholar 

  29. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323

    Article  Google Scholar 

  30. Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1): 17–40

    Article  Google Scholar 

  31. Kailing K, Kriegel HP, Pfeifle M, Schönauer S (2006) Extending metric index structures for efficient range query processing. Knowl Inf Syst 10(2): 211–227

    Article  Google Scholar 

  32. Kamel I, Faloutsos C (1993) On packing R-trees. In: Proceedings of the second international Conference on Information and Knowledge Management (CIKM), pp 490–499

  33. Kamel I, Khalil M, Kouramajian V (1996) Bulk insertion in dynamic R-trees. In: Kraak M, Molenaar M (eds) Proceedings of the fourth international symposium on spatial data handling (SDH’96), pp 3B.31–3B.42

  34. Kaufman L, Rousueeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Google Scholar 

  35. Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing. Knowl Inf Syst 12(1): 25–53

    Article  Google Scholar 

  36. Lee T, Moon B, Lee S (2006) Bulk insertion for R-trees by seeded clustering. Data Knowl Eng 59(1): 86–106

    Article  MathSciNet  Google Scholar 

  37. Leutenegger ST, López MA, Edgington J (1997) STR: a simple and efficient algorithm for R-tree packing. In: Gray A, Larson PÅ(eds) Proceedings of the 13th international conference on data engineering. IEEE Computer Society Press, Los Alamitos, pp 497–506

    Chapter  Google Scholar 

  38. Papadopoulos A, Manolopoulos Y (2003) Parallel bulk-loading of spatial data. Parallel Comput 29(10): 1419–1444

    Article  MathSciNet  Google Scholar 

  39. Skopal T, Pokorny J, Snasel V (2004) PM-tree: pivoting metric tree for similarity search in multimedia databases. In: Proceedings of the annual international workshop on databases, texts, specifications and objects (DATESO 2004), pp 27–37

  40. Traina C Jr, Traina A, Faloutsos C, Seeger B (2002) Fast indexing and visualization of metric data sets using slim-trees. IEEE Trans Knowl Data Eng (TKDE) 14(2): 244–260

    Article  Google Scholar 

  41. Wang X, Wang JTL, Lin KI, Shasha D, Shapiro BA, Zhang K (2000) An index structure for data mining and clustering. Knowl Inf Syst 2(2): 161–184

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lior Aronovich.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aronovich, L., Spiegler, I. Bulk construction of dynamic clustered metric trees. Knowl Inf Syst 22, 211–244 (2010). https://doi.org/10.1007/s10115-009-0195-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0195-1

Keywords

Navigation