Preserving Recomputability of Results from Big Data Transformation Workflows

Kricke, Matthias; Grimmer, Martin; Schmeißer, Michael

doi:10.1007/s13222-017-0265-6

Preserving Recomputability of Results from Big Data Transformation Workflows

Depending on External Systems and Human Interactions

Schwerpunktbeitrag
Published: 12 September 2017

Volume 17, pages 245–253, (2017)
Cite this article

Datenbank-Spektrum Aims and scope Submit manuscript

178 Accesses
Explore all metrics

Abstract

The ability to recompute results from raw data at any time is important for data-driven companies to ensure data stability and to selectively incorporate new data into an already delivered data product. However, data transformation processes are heterogeneous and it is possible that manual work of domain experts is part of the process to create a deliverable data product. Domain experts and their work are expensive and time consuming, a recomputation process needs the ability of automatically adding former human interactions. It becomes even more challenging when external systems are used or data changes over time. In this paper, we propose a system architecture which ensures recomputability of results from big data transformation workflows on internal and external systems by using distributed key-value data stores. Furthermore, the system architecture will contain the possibility of incorporating human interactions of former data transformation processes. We will describe how our approach significantly relieves external systems and at the same time increases the performance of the big data transformation workflows.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

Diftong: a tool for validating big data workflows

Article Open access 18 May 2019

Progressive Growth of ETL Tools: A Literature Review of Past to Equip Future

Notes

References

Accumulo A (2017) Accumulo design – data model. http://accumulo.apache.org/1.8/accumulo_user_manual.html#_accumulo_design. Accessed 27 July 2017
Google Scholar
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: A distributed storage system for structured data. ACM Trans Comput Syst 26(2):4
Article Google Scholar
Corbett JC, Dean J, Epstein M, Fikes A, Frost C, Furman JJ, Ghemawat S, Gubarev A, Heiser C, Hochschild P et al (2013) Spanner: Google’s globally distributed database. ACM Trans Comput Syst 31(3):8
Article Google Scholar
Dragojević A, Narayanan D, Nightingale EB, Renzelmann M, Shamis A, Badam A, Castro M (2015) No compromises: distributed transactions with consistency, availability, and performance. In: Proceedings 25th Symposium on Operating Systems Principles, ACM, pp 54–70
Google Scholar
Gray J, Reuter A (1992) Transaction processing: concepts and techniques. Elsevier, Amsterdam
MATH Google Scholar
Jensen CS, Soo MD, Snodgrass RT (1994) Unifying temporal data models via a conceptual model. Inf Syst 19(7):513–547
Article Google Scholar
Josefsson S (2006) The base16, base32, and base64 data encodings. https://tools.ietf.org/html/rfc4648. Accessed 9 June 2017
Book Google Scholar
Kulkarni K, Michels JE (2012) Temporal features in sql:2011. SIGMOD Rec 41(3):34–43. https://doi.org/10.1145/2380776.2380786
Article Google Scholar
Lee J, Muehle M, May N, Faerber F, Sikka V, Plattner H, Krueger J, Grund M (2013) High-performance transaction processing in sap hana. IEEE Data Eng Bull 36(2):28–33
Google Scholar
Ozsoyoglu G, Snodgrass RT (1995) Temporal and real-time databases: a survey. IEEE Trans Knowl Data Eng 7(4):513–532
Article Google Scholar
Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23(4):3–13
Google Scholar
Srivastava U, Widom J (2004) Flexible time management in data stream systems. In: Proceedings twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM, pp 263–274
Chapter Google Scholar

Download references

Acknowledgements

This work was partly funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B) and Explicit Privacy-Preserving Host Intrusion Detection System EXPLOIDS (BMBF 16KIS0522K).

Author information

Authors and Affiliations

Leipzig University, Augustusplatz 10, 04109, Leipzig, Germany
Matthias Kricke & Martin Grimmer
mgm technology partners, Neumarkt 2, 04109, Leipzig, Germany
Michael Schmeißer

Authors

Matthias Kricke
View author publications
You can also search for this author in PubMed Google Scholar
Martin Grimmer
View author publications
You can also search for this author in PubMed Google Scholar
Michael Schmeißer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthias Kricke.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kricke, M., Grimmer, M. & Schmeißer, M. Preserving Recomputability of Results from Big Data Transformation Workflows. Datenbank Spektrum 17, 245–253 (2017). https://doi.org/10.1007/s13222-017-0265-6

Download citation

Received: 09 June 2017
Accepted: 24 August 2017
Published: 12 September 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s13222-017-0265-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Preserving Recomputability of Results from Big Data Transformation Workflows

Abstract

Access this article

Similar content being viewed by others

Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

Diftong: a tool for validating big data workflows

Progressive Growth of ETL Tools: A Literature Review of Past to Equip Future

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Preserving Recomputability of Results from Big Data Transformation Workflows

Abstract

Access this article

Similar content being viewed by others

Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

Diftong: a tool for validating big data workflows

Progressive Growth of ETL Tools: A Literature Review of Past to Equip Future

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation