ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results

  • Rana Faisal Munir
  • Oscar Romero
  • Alberto Abelló
  • Besim Bilalli
  • Maik Thiele
  • Wolfgang Lehner
Conference paper

DOI: 10.1007/978-3-319-45547-1_4

Part of the Lecture Notes in Computer Science book series (LNCS, volume 9893)
Cite this paper as:
Munir R.F., Romero O., Abelló A., Bilalli B., Thiele M., Lehner W. (2016) ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results. In: Bellatreche L., Pastor Ó., Almendros Jiménez J., Aït-Ameur Y. (eds) Model and Data Engineering. MEDI 2016. Lecture Notes in Computer Science, vol 9893. Springer, Cham

Abstract

Large-scale data analysis is an important activity in many organizations that typically requires the deployment of data-intensive workflows. As data is processed these workflows generate large intermediate results, which are typically pipelined from one operator to the following. However, if materialized, these results become reusable, hence, subsequent workflows need not recompute them. There are already many solutions that materialize intermediate results but all of them assume a fixed data format. A fixed format, however, may not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (e.g., horizontal and vertical) behave better or worse according to the access patterns of the subsequent operations. In this paper, we present ResilientStore, which assists on selecting the most appropriate data format for materializing intermediate results. Given a workflow and a set of materialization points, it uses rule-based heuristics to choose the best storage data format based on subsequent access patterns. We have implemented ResilientStore for HDFS and three different data formats: SequenceFile, Parquet and Avro. Experimental results show that our solution gives 18 % better performance than any solution based on a single fixed format.

Keywords

Big data Data-intensive workflows Intermediate results Data format HDFS 

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Rana Faisal Munir
    • 1
  • Oscar Romero
    • 1
  • Alberto Abelló
    • 1
  • Besim Bilalli
    • 1
  • Maik Thiele
    • 2
  • Wolfgang Lehner
    • 2
  1. 1.Universitat Politécnica de Catalunya (UPC)BarcelonaSpain
  2. 2.Technische Universität Dresden (TUD)DresdenGermany

Personalised recommendations