Bulk Loading a Linear Hash File

  • Davood Rafiei
  • Cheng Hu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4081)


We study the problem of bulk loading a linear hash file; the problem is that a good hash function is able to distribute records into random locations in the file; however, performing a random disk access for each record can be costly and this cost increases with the size of the file. We propose a bulk loading algorithm that can avoid random disk accesses by reducing multiple accesses to the same location into a single access and reordering the accesses such that the pages are accessed sequentially. Our analysis shows that our algorithm is near-optimal with a cost roughly equal to the cost of sorting the dataset, thus the algorithm can scale up to very large datasets. Our experiments show that our method can improve upon the Berkeley DB load utility, in terms of running time, by two orders of magnitude and the improvements scale up well with the size of the dataset.


Hash Function Hash Table Record Movement Naive Algorithm Cache Performance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ailamaki, A., DeWitt, D.J., Hill, M.D., Skounakis, M.: Weaving relations for cache performance. In: Proceedings of the VLDB Conference, Rome, Italy, pp. 169–180 (2001)Google Scholar
  2. 2.
    Amer-Yahia, S., Cluet, S.: A declarative approach to optimize bulk loading into databases. ACM Transactions on Database Systems 29(2), 233–281 (2004)CrossRefGoogle Scholar
  3. 3.
    Böhm, C., Kriegel, H.: Efficient bulk loading of large high-dimensional indexes. In: International Conference on Data Warehousing and Knowledge Discovery, pp. 251–260 (1999)Google Scholar
  4. 4.
    Fenk, R., Kawakami, A., Markl, V., Bayer, R., Osaki, S.: Bulk loading a data warehouse built upon a ub-tree. In: Proceedings of of IDEAS Conference, Yokohoma, Japan, pp. 179–187 (2000)Google Scholar
  5. 5.
    Gray, J.: A conversation with Jim Gray. ACM Queue 1(4) (2003)Google Scholar
  6. 6.
    Hjaltason, G.R., Samet, H., Sussmann, Y.J.: Speeding up bulk-loading of quadtrees. In: Proceedings of the International ACM Workshop on Advances in Geographic Information Systems, Las Vegas, pp. 50–53 (1997)Google Scholar
  7. 7.
    Internet Archive,
  8. 8.
    Jagadish, H.V., Narayan, P.P.S., Seshadri, S., Sudarshan, S., Kanneganti, R.: Incremental organization for data recording and warehousing. In: Proc. of the VLDB Conference, Athens, pp. 16–25 (1997)Google Scholar
  9. 9.
    Knuth, D.: The Art of Computer Programming: vol III, Sorting and Searching, 3rd edn. Addison-Wesley, Reading (1998)Google Scholar
  10. 10.
    Labio, W., Wiener, J.L., Garcia-Molina, H., Gorelik, V.: Efficient resumption of interrupted warehouse loads. In: Proc. of the SIGMOD Conference, Dallas, pp. 46–57 (2000)Google Scholar
  11. 11.
    Larson, P.: Dynamic hash tables. Communications of the ACM 31(4), 446–457 (1988)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Rabin, M.O.: Fingerprinting by random polynomials. Technical Report TR-15-81, Department of Computer Science, Harvard University (1981)Google Scholar
  13. 13.
    Rafiei, D., Hu, C.: Bulk loading a linear hash file: extended version (under preparation)Google Scholar
  14. 14.
    Seltzer, M., Yigit, O.: A new hashing package for unix. In: USENIX, Dallas, pp. 173–184 (1991)Google Scholar
  15. 15.
    Wiener, J.L., Naughton, J.F.: OODB bulk loading revisited: The partitioned-list approach. In: Proceedings of the VLDB Conference, Zurich, Switzerland, pp. 30–41 (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Davood Rafiei
    • 1
  • Cheng Hu
    • 1
  1. 1.University of Alberta 

Personalised recommendations