Skip to main content
Log in

A schema aware ETL workflow generator

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

Extract, Transform and Load (ETL) processes organized as workflows play an important role in data warehousing. As ETL workflows are usually complex, various ETL facilities have been developed to address their control-flow process modeling and execution control. To evaluate the quality of ETL facilities, Synthetic ETL workflow test cases, consisting of control-flow and data-flow aspects are needed to check ETL facility functionalities at construction time and to validate the correctness and performance of ETL facilities at run time. Although there are some synthetic workflow and data set test case generation approaches existed in literatures, little work is done to consider both aspects at the same time specifically for ETL workflow generators. To address this issue, this paper proposes a schema aware ETL workflow generator with which users can characterize their ETL workflows by various parameters and get ETL workflow test cases with control-flow of ETL activities, complied schemas and associated recordsets. Our generator consists of three steps. First, with type and ratio of individual activities and their connection characteristic parameter specification, the generator will produce ETL activities and form ETL skeleton which determine how generated activities are cooperated with each other. Second, with schema transformation characteristic parameter specification, e.g. ranges of numbers of attributes, the generator will resolve attribute dependencies and refine input/output schemas with complied attributes and their data types. In the last step, recordsets are generated following cardinality specifications. ETL workflows in specific patterns are produced in the experiment in order to show the ability of our generator. Also experiments to generate thousands of ETL workflow test cases in seconds have been done to verify the usability of the generator.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The subscripts are always starting from 1 in this paper.

References

  • Anja, B., Hasso, P., & Alexander, Z. (2010). A mixed transaction processing and operational reporting benchmark. Information Systems Frontiers. doi:10.1007/s10796-010-9283-8.

  • Binnig, C., Kossmann, D., Lo, E., & Tamer Ozsu, M. (2007). QAGen: generating query-aware test databases. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data (SIGMOD ’07) (pp. 341–352). New York: ACM.

    Google Scholar 

  • Boehm, M., Habich, D., Preissler, S., Lehner, W., & Wloka, U. (2009). Cost-based vectorization of instance-based integration processes. Proceedings of the 13th East European Conference on Advances in Databases and Information Systems. doi:10.1007/978-3-642-03973-7_19.

  • Buggert, J., Wyatt, L., Caufield, B., & Pol, D. (2009). Preview of TPC-ETL: A Benchmark Under Development. Transaction Processing Performance Council. www.tpc.org/tpctc2009/tpctc2009-14.pdf. Accessed 10 March 2011.

  • Chrzastowski-Wachtel, P., Benatallah, B., Hamadi, R., O’Dell, M., & Susanto, A. (2003). A top-down petri net-based approach for dynamic workflow modeling. Business Process Management. doi:10.1007/3-540-44895-0_23.

  • Dessloch, S., Hernandez, M. A., Wisnesky, R., Radwan, A., & Zhou, J. (2008) Orchid: Integrating Schema Mapping and ETL. Proceedings of IEEE 24th International Conference on Data Engineering, 1307–1316.

  • Gray, J., Sundaresan, P., Englert, S., Baclawski, K., & Weinberger, P. J. (1994). Quickly generating billion-record synthetic databases. Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. doi:10.1145/191839.191886.

  • Kimball, R., & Caserta, J. (2004). The data warehouse ETL toolkit. Hoboken: Wiley.

    Google Scholar 

  • Olston, C., Chopra, S., & Srivastava, U. (2009). Generating example data for dataflow programs. Proceedings of the 35th SIGMOD International Conference on Management of Data. doi:10.1145/1559845.1559873.

  • Poess, M, & Stephens, J. M. (2004). Generating thousand benchmark queries in seconds. Proceedings of the Thirtieth international conference on Very large data bases, 1045–1053.

  • Rao, L., & Osei-Bryson, K. (2008). An approach for incorporating quality-based cost–benefit analysis in data warehouse design. Information Systems Frontiers, 10(3), 361–373.

    Article  Google Scholar 

  • Simitsis, A., Vassiliadis, P., & Sellis, T. (2005). Optimizing ETL Processes in Data Warehouses. Proceedings of 21st International Conference on Data Engineering, 564–575.

  • Simitsis, A., Vassiliadis, P., Dayal, U., Karagiannis, A., & Tziovara, V. (2009). Benchmarking ETL workflows. Performance Evaluation and Benchmarking, 5895, 199–220.

    Google Scholar 

  • Simitsis, A., Wilkinson, K., Dayal, U., & Castellanos, M. (2010). Optimizing ETL workflows for fault-tolerance. Proceedings of 26th International Conference on Data Engineering. doi:10.1109/ICDE.2010.5447816.

  • Tziovara, V., Vassiliadis, P., & Simitsis, A. (2007). Deciding the physical implementation of ETL workflows. Proceedings of the ACM Tenth International Workshop on Data Warehousing and OLAP. doi:10.1145/1317331.1317341.

  • Vassiliadis, P., Simitsis, A., & Baikousi, E. (2009). A taxonomy of ETL activities. Proceeding of the ACM Twelfth International Workshop on Data Warehousing and OLAP. doi:10.1145/1651291.1651297.

  • Zogsolver (2009). http://zogsolver.sourceforge.net. Accessed 5 January 2011.

Download references

Acknowledgments

The work is supported by the National Basic Research Program of China(No. 2009CB320706).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Naiqiao Du.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Du, N., Ye, X. & Wang, J. A schema aware ETL workflow generator. Inf Syst Front 16, 453–471 (2014). https://doi.org/10.1007/s10796-012-9352-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-012-9352-2

Keywords

Navigation