Bulk Loading a Linear Hash File

Rafiei, Davood; Hu, Cheng

doi:10.1007/11823728_3

Davood Rafiei¹⁸ &
Cheng Hu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4081))

Included in the following conference series:

International Conference on Data Warehousing and Knowledge Discovery

766 Accesses
1 Citations

Abstract

We study the problem of bulk loading a linear hash file; the problem is that a good hash function is able to distribute records into random locations in the file; however, performing a random disk access for each record can be costly and this cost increases with the size of the file. We propose a bulk loading algorithm that can avoid random disk accesses by reducing multiple accesses to the same location into a single access and reordering the accesses such that the pages are accessed sequentially. Our analysis shows that our algorithm is near-optimal with a cost roughly equal to the cost of sorting the dataset, thus the algorithm can scale up to very large datasets. Our experiments show that our method can improve upon the Berkeley DB load utility, in terms of running time, by two orders of magnitude and the improvements scale up well with the size of the dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ailamaki, A., DeWitt, D.J., Hill, M.D., Skounakis, M.: Weaving relations for cache performance. In: Proceedings of the VLDB Conference, Rome, Italy, pp. 169–180 (2001)
Google Scholar
Amer-Yahia, S., Cluet, S.: A declarative approach to optimize bulk loading into databases. ACM Transactions on Database Systems 29(2), 233–281 (2004)
Article Google Scholar
Böhm, C., Kriegel, H.: Efficient bulk loading of large high-dimensional indexes. In: International Conference on Data Warehousing and Knowledge Discovery, pp. 251–260 (1999)
Google Scholar
Fenk, R., Kawakami, A., Markl, V., Bayer, R., Osaki, S.: Bulk loading a data warehouse built upon a ub-tree. In: Proceedings of of IDEAS Conference, Yokohoma, Japan, pp. 179–187 (2000)
Google Scholar
Gray, J.: A conversation with Jim Gray. ACM Queue 1(4) (2003)
Google Scholar
Hjaltason, G.R., Samet, H., Sussmann, Y.J.: Speeding up bulk-loading of quadtrees. In: Proceedings of the International ACM Workshop on Advances in Geographic Information Systems, Las Vegas, pp. 50–53 (1997)
Google Scholar
Internet Archive, http://www.archive.org
Jagadish, H.V., Narayan, P.P.S., Seshadri, S., Sudarshan, S., Kanneganti, R.: Incremental organization for data recording and warehousing. In: Proc. of the VLDB Conference, Athens, pp. 16–25 (1997)
Google Scholar
Knuth, D.: The Art of Computer Programming: vol III, Sorting and Searching, 3rd edn. Addison-Wesley, Reading (1998)
Google Scholar
Labio, W., Wiener, J.L., Garcia-Molina, H., Gorelik, V.: Efficient resumption of interrupted warehouse loads. In: Proc. of the SIGMOD Conference, Dallas, pp. 46–57 (2000)
Google Scholar
Larson, P.: Dynamic hash tables. Communications of the ACM 31(4), 446–457 (1988)
Article MathSciNet Google Scholar
Rabin, M.O.: Fingerprinting by random polynomials. Technical Report TR-15-81, Department of Computer Science, Harvard University (1981)
Google Scholar
Rafiei, D., Hu, C.: Bulk loading a linear hash file: extended version (under preparation)
Google Scholar
Seltzer, M., Yigit, O.: A new hashing package for unix. In: USENIX, Dallas, pp. 173–184 (1991)
Google Scholar
Wiener, J.L., Naughton, J.F.: OODB bulk loading revisited: The partitioned-list approach. In: Proceedings of the VLDB Conference, Zurich, Switzerland, pp. 30–41 (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Alberta,
Davood Rafiei & Cheng Hu

Authors

Davood Rafiei
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Hu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstr. 9-11/188, A-1040, Wien, Austria
A Min Tjoa
Department of Software and Computing Systems, University of Alicante, Spain
Juan Trujillo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rafiei, D., Hu, C. (2006). Bulk Loading a Linear Hash File. In: Tjoa, A.M., Trujillo, J. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2006. Lecture Notes in Computer Science, vol 4081. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11823728_3

Download citation

DOI: https://doi.org/10.1007/11823728_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37736-8
Online ISBN: 978-3-540-37737-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics