Abstract
Extract, Transform and Load (ETL) processes organized as workflows play an important role in data warehousing. As ETL workflows are usually complex, various ETL facilities have been developed to address their control-flow process modeling and execution control. To evaluate the quality of ETL facilities, Synthetic ETL workflow test cases, consisting of control-flow and data-flow aspects are needed to check ETL facility functionalities at construction time and to validate the correctness and performance of ETL facilities at run time. Although there are some synthetic workflow and data set test case generation approaches existed in literatures, little work is done to consider both aspects at the same time specifically for ETL workflow generators. To address this issue, this paper proposes a schema aware ETL workflow generator with which users can characterize their ETL workflows by various parameters and get ETL workflow test cases with control-flow of ETL activities, complied schemas and associated recordsets. Our generator consists of three steps. First, with type and ratio of individual activities and their connection characteristic parameter specification, the generator will produce ETL activities and form ETL skeleton which determine how generated activities are cooperated with each other. Second, with schema transformation characteristic parameter specification, e.g. ranges of numbers of attributes, the generator will resolve attribute dependencies and refine input/output schemas with complied attributes and their data types. In the last step, recordsets are generated following cardinality specifications. ETL workflows in specific patterns are produced in the experiment in order to show the ability of our generator. Also experiments to generate thousands of ETL workflow test cases in seconds have been done to verify the usability of the generator.
Similar content being viewed by others
Notes
The subscripts are always starting from 1 in this paper.
References
Anja, B., Hasso, P., & Alexander, Z. (2010). A mixed transaction processing and operational reporting benchmark. Information Systems Frontiers. doi:10.1007/s10796-010-9283-8.
Binnig, C., Kossmann, D., Lo, E., & Tamer Ozsu, M. (2007). QAGen: generating query-aware test databases. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data (SIGMOD ’07) (pp. 341–352). New York: ACM.
Boehm, M., Habich, D., Preissler, S., Lehner, W., & Wloka, U. (2009). Cost-based vectorization of instance-based integration processes. Proceedings of the 13th East European Conference on Advances in Databases and Information Systems. doi:10.1007/978-3-642-03973-7_19.
Buggert, J., Wyatt, L., Caufield, B., & Pol, D. (2009). Preview of TPC-ETL: A Benchmark Under Development. Transaction Processing Performance Council. www.tpc.org/tpctc2009/tpctc2009-14.pdf. Accessed 10 March 2011.
Chrzastowski-Wachtel, P., Benatallah, B., Hamadi, R., O’Dell, M., & Susanto, A. (2003). A top-down petri net-based approach for dynamic workflow modeling. Business Process Management. doi:10.1007/3-540-44895-0_23.
Dessloch, S., Hernandez, M. A., Wisnesky, R., Radwan, A., & Zhou, J. (2008) Orchid: Integrating Schema Mapping and ETL. Proceedings of IEEE 24th International Conference on Data Engineering, 1307–1316.
Gray, J., Sundaresan, P., Englert, S., Baclawski, K., & Weinberger, P. J. (1994). Quickly generating billion-record synthetic databases. Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. doi:10.1145/191839.191886.
Kimball, R., & Caserta, J. (2004). The data warehouse ETL toolkit. Hoboken: Wiley.
Olston, C., Chopra, S., & Srivastava, U. (2009). Generating example data for dataflow programs. Proceedings of the 35th SIGMOD International Conference on Management of Data. doi:10.1145/1559845.1559873.
Poess, M, & Stephens, J. M. (2004). Generating thousand benchmark queries in seconds. Proceedings of the Thirtieth international conference on Very large data bases, 1045–1053.
Rao, L., & Osei-Bryson, K. (2008). An approach for incorporating quality-based cost–benefit analysis in data warehouse design. Information Systems Frontiers, 10(3), 361–373.
Simitsis, A., Vassiliadis, P., & Sellis, T. (2005). Optimizing ETL Processes in Data Warehouses. Proceedings of 21st International Conference on Data Engineering, 564–575.
Simitsis, A., Vassiliadis, P., Dayal, U., Karagiannis, A., & Tziovara, V. (2009). Benchmarking ETL workflows. Performance Evaluation and Benchmarking, 5895, 199–220.
Simitsis, A., Wilkinson, K., Dayal, U., & Castellanos, M. (2010). Optimizing ETL workflows for fault-tolerance. Proceedings of 26th International Conference on Data Engineering. doi:10.1109/ICDE.2010.5447816.
Tziovara, V., Vassiliadis, P., & Simitsis, A. (2007). Deciding the physical implementation of ETL workflows. Proceedings of the ACM Tenth International Workshop on Data Warehousing and OLAP. doi:10.1145/1317331.1317341.
Vassiliadis, P., Simitsis, A., & Baikousi, E. (2009). A taxonomy of ETL activities. Proceeding of the ACM Twelfth International Workshop on Data Warehousing and OLAP. doi:10.1145/1651291.1651297.
Zogsolver (2009). http://zogsolver.sourceforge.net. Accessed 5 January 2011.
Acknowledgments
The work is supported by the National Basic Research Program of China(No. 2009CB320706).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Du, N., Ye, X. & Wang, J. A schema aware ETL workflow generator. Inf Syst Front 16, 453–471 (2014). https://doi.org/10.1007/s10796-012-9352-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10796-012-9352-2