Datenbank-Spektrum

, Volume 17, Issue 3, pp 245–253 | Cite as

Preserving Recomputability of Results from Big Data Transformation Workflows

Depending on External Systems and Human Interactions
  • Matthias Kricke
  • Martin Grimmer
  • Michael Schmeißer
Schwerpunktbeitrag
  • 35 Downloads

Abstract

The ability to recompute results from raw data at any time is important for data-driven companies to ensure data stability and to selectively incorporate new data into an already delivered data product. However, data transformation processes are heterogeneous and it is possible that manual work of domain experts is part of the process to create a deliverable data product. Domain experts and their work are expensive and time consuming, a recomputation process needs the ability of automatically adding former human interactions. It becomes even more challenging when external systems are used or data changes over time. In this paper, we propose a system architecture which ensures recomputability of results from big data transformation workflows on internal and external systems by using distributed key-value data stores. Furthermore, the system architecture will contain the possibility of incorporating human interactions of former data transformation processes. We will describe how our approach significantly relieves external systems and at the same time increases the performance of the big data transformation workflows.

Keywords

BigData Recomputability System architecture Bitemporality Time-to-consistency 

References

  1. 1.
    Accumulo A (2017) Accumulo design – data model. http://accumulo.apache.org/1.8/accumulo_user_manual.html#_accumulo_design. Accessed 27 July 2017Google Scholar
  2. 2.
    Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: A distributed storage system for structured data. ACM Trans Comput Syst 26(2):4CrossRefGoogle Scholar
  3. 3.
    Corbett JC, Dean J, Epstein M, Fikes A, Frost C, Furman JJ, Ghemawat S, Gubarev A, Heiser C, Hochschild P et al (2013) Spanner: Google’s globally distributed database. ACM Trans Comput Syst 31(3):8CrossRefGoogle Scholar
  4. 4.
    Dragojević A, Narayanan D, Nightingale EB, Renzelmann M, Shamis A, Badam A, Castro M (2015) No compromises: distributed transactions with consistency, availability, and performance. In: Proceedings 25th Symposium on Operating Systems Principles, ACM, pp 54–70Google Scholar
  5. 5.
    Gray J, Reuter A (1992) Transaction processing: concepts and techniques. Elsevier, AmsterdamMATHGoogle Scholar
  6. 6.
    Jensen CS, Soo MD, Snodgrass RT (1994) Unifying temporal data models via a conceptual model. Inf Syst 19(7):513–547CrossRefGoogle Scholar
  7. 7.
    Josefsson S (2006) The base16, base32, and base64 data encodings. https://tools.ietf.org/html/rfc4648. Accessed 9 June 2017CrossRefGoogle Scholar
  8. 8.
    Kulkarni K, Michels JE (2012) Temporal features in sql:2011. SIGMOD Rec 41(3):34–43. https://doi.org/10.1145/2380776.2380786 CrossRefGoogle Scholar
  9. 9.
    Lee J, Muehle M, May N, Faerber F, Sikka V, Plattner H, Krueger J, Grund M (2013) High-performance transaction processing in sap hana. IEEE Data Eng Bull 36(2):28–33Google Scholar
  10. 10.
    Ozsoyoglu G, Snodgrass RT (1995) Temporal and real-time databases: a survey. IEEE Trans Knowl Data Eng 7(4):513–532CrossRefGoogle Scholar
  11. 11.
    Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23(4):3–13Google Scholar
  12. 12.
    Srivastava U, Widom J (2004) Flexible time management in data stream systems. In: Proceedings twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM, pp 263–274CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Deutschland 2017

Authors and Affiliations

  1. 1.Leipzig UniversityLeipzigGermany
  2. 2.mgm technology partnersLeipzigGermany

Personalised recommendations