A Scalable Framework for Universal Data Generation in Parallel

  • Ling Gu
  • Minqi ZhouEmail author
  • Qiangqiang Kang
  • Aoying Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8904)


Nowadays, more and more companies, such as Amazon, Twitter and etc., are facing the big data problem, which requires higher performance to manage tremendous large data sets. Data management systems with a new architecture taking full advantages of computer hardware are emerging, on the purpose of maximizing the system performance and fulfilling customs’ current or even future requirements. How to test performance and confirm the suitability of the new data management system becomes a primary task of these companies. Hence, how to generate a scaled data set with desired volumes and in desired velocity effectively becomes a problem imperative to be solved, together with the goal to keep the characters of their real data set as many as possible (realistic). In this paper, we proposed PSUG to generate a realistic database in terms of required volume and velocity in a scalable parallel manner. Our extensive experimental studies confirm the efficiency and effectiveness of our proposed method.



Our research is supported by Innovation Program of Shanghai Municipal Education Commission (No. 14ZZ045) and National Science Foundation of China under grunt No. 61332006 and No. 61432006.


  1. 1.
    Arasu, A., Kaushik, R., Li, J.: Data generation using declarative constraints. In: SIGMOD Conference, pp. 685–696 (2011)Google Scholar
  2. 2.
    Binnig, C., Kossmann, D., Lo, E., Özsu, M.T.: Qagen: generating query-aware test databases. In: SIGMOD Conference, pp. 341–352 (2007)Google Scholar
  3. 3.
    Bruno, N., Chaudhuri, S.: Flexible database generators. In: VLDB, pp. 1097–1107 (2005)Google Scholar
  4. 4.
    Endres, D.M., Schindelin, J.E.: Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theory 37(1), 0018–9448 (1991)CrossRefGoogle Scholar
  5. 5.
    Frank, M., Poess, M., Rabl, T.: Efficient update data generation for dbms benchmarks. In: ICPE, pp. 169–180 (2012)Google Scholar
  6. 6.
    Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P.J.: Quickly generating billion-record synthetic databases. In: SIGMOD Conference, pp. 243–252 (1994)Google Scholar
  7. 7.
    Hardy, G.H., Wright, E.M.: An Introduction to the Theory of Numbers. Oxford University Press, Oxford (2008)zbMATHGoogle Scholar
  8. 8.
    Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. SIGMOD Rec. 36(1), 19–24 (2007)CrossRefGoogle Scholar
  9. 9.
    Houkjær, K., Torp, K., Wind, R.: Simple and realistic data generation. In: VLDB, pp. 1243–1246 (2006)Google Scholar
  10. 10.
    Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: SIGMOD Conference, pp. 647–658 (2004)Google Scholar
  11. 11.
    Lo, E., Cheng, N., Hon, W.K.: Generating databases for query workloads. PVLDB 3(1), 848–859 (2010)Google Scholar
  12. 12.
    Tay, Y.C.: Data generation for application-specific benchmarking. PVLDB 4(12), 1470–1473 (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Ling Gu
    • 1
  • Minqi Zhou
    • 1
    Email author
  • Qiangqiang Kang
    • 1
  • Aoying Zhou
    • 1
  1. 1.Institute of Data Science and EngineeringEast China Normal UniversityShanghaiChina

Personalised recommendations