Efficient Bulk Loading of Large High-Dimensional Indexes

  • Christian Böhm
  • Hans-Peter Kriegel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1676)


Efficient index construction in multidimensional data spaces is important for many knowledge discovery algorithms, because construction times typically must be amortized by performance gains in query processing. In this paper, we propose a generic bulk loading method which allows the application of user-defined split strategies in the index construction. This approach allows the adaptation of the index properties to the requirements of a specific knowledge discovery algorithm. As our algorithm takes into account that large data sets do not fit in main memory, our algorithm is based on external sorting. Decisions of the split strategy can be made according to a sample of the data set which is selected automatically. The sort algorithm is a variant of the well-known Quicksort algorithm, enhanced to work on secondary storage. The index construction has a runtime complexity of O(n log n). We show both analytically and experimentally that the algorithm outperforms traditional index construction methods by large factors.


Split Strategy Improvement Factor Index Construction Data Page Secondary Storage 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [BBBK 99]
    Böhm, Braunmüller, Breunig, Kriegel: ‘Fast Clustering Using High-Dimensional Similarity Joins’, submitted for publication, 1999.Google Scholar
  2. [BBJ+ 99]
    Berchtold, Böhm, Jagadish, Kriegel, Sander: ‘Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces’, submitted for publication, 1999.Google Scholar
  3. [BBK 98a]
    Berchtold, Böhm, Kriegel: ‘Improving the Query Performance of High-Dimensional Index Structures Using Bulk-Load Operations’, Int. Conf. on Extending Database Techn., EDBT, 1998.Google Scholar
  4. [BBKK 97]
    Berchtold, Böhm, Keim, Kriegel: ‘A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space’, ACM PODS Symp. Principles of Database Systems, 1997.Google Scholar
  5. [BK 99]
    Böhm, Kriegel: ‘Dynamically Optimizing High-Dimensional Index Structures’, subm., 1999.Google Scholar
  6. [BKK 96]
    Berchtold, Keim, Kriegel: ‘The X-Tree: An Index Structure for High-Dimensional Data’,Int. Conf. on Very Large Data Bases, VLDB, 1996.Google Scholar
  7. [BSW 97]
    van den Bercken, Seeger, Widmayer: ‘A General Approach to Bulk Loading Multidimensional Index Structures’, Int. Conf. on Very Large Databases, VLDB, 1997.Google Scholar
  8. [Böh 98]
    Böhm: ‘Efficiently Indexing High-Dimensional Data Spaces’, PhD Thesis, University of Munich, Herbert Utz Verlag, 1998.Google Scholar
  9. [EKSX 96]
    Ester, Kriegel, Sander, Xu: ‘A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise’, Int. Conf. Knowl. Disc. and Data Mining, KDD, 1996.Google Scholar
  10. [Hoa 62]
    Hoare: ‘Quicksort’, Computer Journal, Vol. 5, No. 1, 1962.Google Scholar
  11. [JD 88]
    Jain, Dubes: ‘Algorithms for Clustering Data’, Prentice-Hall, Inc., 1988.Google Scholar
  12. [JW96]
    Jain, White: ‘Similarity Indexing: Algorithms and Performance’, SPIE Storage and Retrieval for Image and Video Databases IV, Vol. 2670, 1996.Google Scholar
  13. [KF 94]
    Kamel, Faloutsos: ‘Hilbert R-tree: An Improved R-tree using Fractals’. Int. Conf. on Very Large Data Bases, VLDB, 1994.Google Scholar
  14. [KR 90]
    Kaufman, Rousseeuw: ‘Finding Groups in Data: An Introduction to Cluster Analysis’, John Wiley & Sons, 1990.Google Scholar
  15. [NH 94]
    Ng, Han: ‘Efficient and Effective Clustering Methods for Spatial Data Mining’, Int. Conf. on Very Large Data Bases, VLDB, 1994.Google Scholar
  16. [Sed 78]
    Sedgewick: ‘Quicksort’, Garland, New York, 1978.Google Scholar
  17. [WSB 98]
    Weber, Schek, Blott: ‘A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces’, Int. Conf. on Very Large Databases, VLDB, 1998.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Christian Böhm
    • 1
  • Hans-Peter Kriegel
    • 1
  1. 1.University of MunichMunichGermany

Personalised recommendations