Advertisement

Composite Key Generation on a Shared-Nothing Architecture

  • Marie HoffmannEmail author
  • Alexander Alexandrov
  • Periklis Andritsos
  • Juan Soto
  • Volker Markl
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8904)

Abstract

Generating synthetic data sets is integral to benchmarking, debugging, and simulating future scenarios. As data sets become larger, real data characteristics thereby become necessary for the success of new algorithms. Recently introduced software systems allow for synthetic data generation that is truly parallel. These systems use fast pseudorandom number generators and can handle complex schemas and uniqueness constraints on single attributes. Uniqueness is essential for forming keys, which identify single entries in a database instance. The uniqueness property is usually guaranteed by sampling from a uniform distribution and adjusting the sample size to the output size of the table such that there are no collisions. However, when it comes to real composite keys, where only the combination of the key attribute has the uniqueness property, a different strategy needs to be employed. In this paper, we present a novel approach on how to generate composite keys within a parallel data generation framework. We compute a joint probability distribution that incorporates the distributions of the key attributes and use the unique sequence positions of entries to address distinct values in the key domain.

Keywords

Hash Function Pseudorandom Number Generator Attribute Domain Output Domain Joint Histogram 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

We thank the anonymous reviewers for their input that helped to improve the quality of the paper. Furthermore, the first author would like to thank Christian Lessig for his valuable assistance in editing.

Supplementary material

References

  1. 1.
    Alexandrov, A., Tzoumas, K., Markl, V.: Myriad: scalable and expressive data generation. Proc. VLDB Endowment 5(12), 1890–1893 (2012)CrossRefGoogle Scholar
  2. 2.
    Bruno, N., Chaudhuri, S.: Flexible database generators. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB 2005, pp. 1097–1107. VLDB Endowment (2005)Google Scholar
  3. 3.
    Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)CrossRefzbMATHGoogle Scholar
  4. 4.
    Eichenauer-Herrmann, J.: Explicit inversive congruential pseudorandom numbers: the compound approach. Computing 51(2), 175–182 (1993)CrossRefzbMATHMathSciNetGoogle Scholar
  5. 5.
    Eichenauer-Herrmann, J.: Statistical independence of a new class of inversive congruential pseudorandom numbers. Math. Comput. 60(201), 375–384 (1993)CrossRefzbMATHMathSciNetGoogle Scholar
  6. 6.
    Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P.J.: Quickly generating billion-record synthetic databases. In: ACM SIGMOD Record, vol. 23, pp. 243–252. ACM (1994)Google Scholar
  7. 7.
    Hoag, J.E.: Synthetic Data Generation: Theory, Techniques and Applications. PhD thesis, University of Arkansas (2007)Google Scholar
  8. 8.
    Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. ACM SIGMOD Rec. 36(1), 19–24 (2007)CrossRefGoogle Scholar
  9. 9.
    Marsaglia, G.: Xorshift rngs. J. Stat. Softw. 8(14), 1–6, 7 (2003)Google Scholar
  10. 10.
    Panneton, F., L’ecuyer, P.: On the xorshift random number generators. ACM Trans. Model. Comput. Simul. 15(4), 346–361 (2005)CrossRefGoogle Scholar
  11. 11.
    Rabl, T., Poess, M.: Parallel data generation for performance analysis of large, complex RDBMS. DBTest, pp. 1–6 (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Marie Hoffmann
    • 1
    Email author
  • Alexander Alexandrov
    • 1
  • Periklis Andritsos
    • 2
  • Juan Soto
    • 1
  • Volker Markl
    • 1
  1. 1.DIMATechnische Universität BerlinBerlinGermany
  2. 2.Institut des Systèmes d’InformationUniversité de LausanneLausanneSwitzerland

Personalised recommendations