Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously, by deploying data-intensive workflows (DIWs). These DIWs of different users share many common tasks (i.e, 50–80%), which can be materialized and reused in future executions. Materializing the output of such common tasks improves the overall processing time of DIWs and also saves computational resources. Current solutions for materialization store data on Distributed File Systems by using a fixed storage format. However, a fixed choice is not the optimal one for every situation. Specifically, different layouts (i.e., horizontal, vertical or hybrid) have a huge impact on execution, according to the access patterns of the subsequent operations. In this paper, we present a cost-based approach that helps deciding the most appropriate storage format in every situation. A generic cost-based framework that selects the best format by considering the three main layouts is presented. Then, we use our framework to instantiate cost models for specific Hadoop storage formats (namely SequenceFile, Avro and Parquet), and test it with two standard benchmark suits. Our solution gives on average 1.33\(\times \) speedup over fixed SequenceFile, 1.11\(\times \) speedup over fixed Avro, 1.32\(\times \) speedup over fixed Parquet, and overall, it provides 1.25\(\times \) speedup.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 1103–1114 (2014)
Atscale: Big data maturity survey. Cloudera (2016)
Azim, T., Karpathiotakis, M., Ailamaki, A.: Recache: reactive caching for fast analytics over heterogeneous data. PVLDB 11(3), 324–337 (2017)
Bian, H., Yan, Y., Tao, W., Chen, L.J., Chen, Y., Du, X., Moscibroda, T.: Wide table layout optimization based on column ordering and duplication. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14–19, 2017, pp. 299–314 (2017)
Cardenas, A.F.: Analysis and performance of inverted data base structures. Commun. ACM 18(5), 253–263 (1975)
Chen, Y., Alspaugh, S., Katz, R.H.: Interactive analytical processing in big data systems: a cross-industry study of mapreduce workloads. PVLDB 5(12), 1802–1813 (2012)
DeWitt, D.J., Halverson, A., Nehme, R.V., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22–27, 2013, pp. 1255–1266 (2013)
Elghandour, I., Aboulnaga, A.: Restore: reusing results of mapreduce jobs. PVLDB 5(6), 586–597 (2012)
Elmore, A.J., Duggan, J., Stonebraker, M., Balazinska, M., Çetintemel, U., Gadepally, V., Heer, J., Howe, B., Kepner, J., Kraska, T., Madden, S., Maier, D., Mattson, T.G., Papadopoulos, S., Parkhurst, J., Tatbul, N., Vartak, M., Zdonik, S.: A demonstration of the bigdawg polystore system. PVLDB 8(12), 1908–1911 (2015)
Färber, F., Cha, S.K., Primsch, J., Bornhövd, C., Sigg, S., Lehner, W.: SAP HANA database: data management for modern business applications. SIGMOD Rec. 40(4), 45–51 (2011)
Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-oriented storage techniques for mapreduce. PVLDB 4(7), 419–429 (2011)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB 4(11), 1111–1122 (2011)
Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my data files. here are my queries. where are my results? In: CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 9–12, 2011, Online Proceedings, pp. 57–68 (2011)
Jindal, A., Karanasos, K., Rao, S., Patel, H.: Selecting subexpressions to materialize at datacenter scale. PVLDB 11(7), 800–812 (2018)
Jindal, A., Qiao, S., Patel, H., Yin, Z., Di, J., Bag, M., Friedman, M., Lin, Y., Karanasos, K., Rao, S.: Computation reuse in analytics job service at microsoft. In: SIGMOD Conference, pp. 191–203 (2018)
Jindal, A., Quiané-Ruiz, J., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: ACM Symposium on Cloud Computing in conjunction with SOSP 2011, SOCC ’11, Cascais, Portugal, October 26–28, 2011, p. 21 (2011)
Jindal, A., Quiané-Ruiz, J., Dittrich, J.: Wwhow! freeing data storage from cages. In: CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6–9, 2013, Online Proceedings (2013)
Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. IEEE Trans. Knowl. Data Eng. 28(5), 1203–1216 (2016)
Kalavri, V., Shang, H., Vlassov, V.: m2r2: a framework for results materialization and reuse in high-level dataflow systems for big data. In: 16th IEEE International Conference on Computational Science and Engineering, CSE 2013, December 3–5, 2013, Sydney, Australia, pp. 894–901 (2013)
Laga, A., Boukhobza, J., Koskas, M., Singhoff, F.: Lynx: a learning linux prefetching mechanism for SSD performance model. In: NVMSA, pp. 1–6 (2016)
Munir, R.F., Nadal, S., Romero, O., Abelló, A., Jovanovic, P., Thiele, M., Lehner, W.: Intermediate results materialization selection and format for data-intensive flows. Fundam. Inf. 163(2), 111–138 (2018)
Munir, R.F., Romero, O., Abelló, A., Bilalli, B., Thiele, M., Lehner, W.: Resilientstore: A heuristic-based data format selector for intermediate results. In: Model and Data Engineering - 6th International Conference, MEDI 2016, Almería, Spain, September 21–23, 2016, Proceedings, pp. 42–56 (2016)
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3(1–2), 494–505 (2010)
Raman, V., Attaluri, G.K., Barber, R., Chainani, N., Kalmuk, D., KulandaiSamy, V., Leenstra, J., Lightstone, S., Liu, S., Lohman, G.M., Malkemus, T., Müller, R., Pandis, I., Schiefer, B., Sharpe, D., Sidle, R., Storm, A.J., Zhang, L.: DB2 with BLU acceleration: so much more than just a column store. PVLDB 6(11), 1080–1091 (2013)
Schaarschmidt, M., Gessert, F., Ritter, N.: Towards automated polyglot persistence. In: Datenbanksysteme für Business, Technologie und Web (BTW), 16. Fachtagung des GI-Fachbereichs “Datenbanken und Informationssysteme” (DBIS), 4.-6.3.2015 in Hamburg, Germany. Proceedings, pp. 73–82 (2015)
Shvachko, K.V.: Hdfs scalability: the limits to growth. Login 35(2), 6–16 (2010)
Silva, Y.N., Larson, P.A., Zhou, J.: Exploiting common subexpressions for cloud query processing. In: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, pp. 1337–1348 (2012)
This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate “Information Technologies for Business Intelligence—Doctoral College” (IT4BI-DC)
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This appendix shows the file sizes for the three considered HDFS file formats, together with the system variables with their values according to our testbed. Table 3 lists all the system variables. They are divided in three categories. First category has the variables related to disk which are important to calculate the reading and writing cost. Additionally, second category has variables for network to calculate the transfer cost, since Hadoop writes multiple copy of data for fault tolerance purpose and this involves writing to other nodes. For this writing, it needs to transfer data, and it is important in calculating the overall write cost. Final category lists the variables related to the configuration of our Hadoop cluster.
A.1 SequenceFile (SeqFile) format
SeqFileFootnote 16 is introduced in 2009 to improve the performance of MapReduce framework. It is used to store the temporary output of map phases as compressed to reduce I/Os. Moreover, it is also splittable which is ideal for processing in parallel. It considers a special type of horizontal layout, which stores data in the form of key-value pairs. Figure 18 shows its structure and Table 4 shows the specific variables of SeqFile with their values.
To instantiate from our generic cost model, we need to estimate the sizes of header, body and footer sections. The header section of SeqFile has a fixed size, so we define it as a constant. To estimate body size, we need to calculate row and metadata sizes. SeqFile divides each row into a key-value pair and stores one column into the key, and the remaining columns into the value by using a user-defined separator. Thus, it has two types of metadata: one is used to separate values and another to make blocks for parallel processing. Then, the size of a row is compound of some fields of fixed size (i.e., record and key lengths) together with the corresponding key-value pair as shown in Fig. 18, containing all user columns (notice that we need two less user-defined separators than columns, because the key is managed by the file format itself). Equation (27) is estimating this size (i.e., a row for SeqFile), which is later used in Eq. (28) to estimate the size of all key-value pairs. Equation (29) calculates the overhead of block-related metadata (i.e., sync markers), which SeqFile introduces at fixed intervals. Finally, Eq. (30) simply adds the size of key-value pairs and metadata, which allows in turn to obtain the total size of SeqFile using Eq. (1) with an empty footer section.
A.2 Avro format
Apache AvroFootnote 17 is a language-neutral data serialization system. It means Avro can be written in one language and can be read in another language without changing the code. This support is provided by the schema information which Avro stores as a meta information. Moreover, it is also compressible and splitable. It is a horizontal layout and Fig. 19 sketches its physical structure. Moreover, there are specific variables for Avro which are given in Table 5. The data schema is stored in a header section of variable length. Similarly, the size of body is also variable and it depends on the number of rows in an IR.
Header section of Avro contains meta information corresponding to the schema of the data in the form a JSON. Given that the size of the schema is orders of magnitude smaller that data, we estimate it as a constant per column. Considering also the version and codec information, the overall header size is calculated by Eq. (31). Following the horizontal layout, Avro adds metadata to each row, which is considered in Eq. (32) to estimate the size of a row. Moreover, it also adds extra metadata in the body for every block. Thus, Eq. (33) is calculating the total size of metadata by multiplying the number of blocks by the size of sync marker and that of counter for the number of rows in the block. Finally, Eq. (34) is used to calculate the body size, which allows in turn to obtain the total size of Avro using Eq. (1) with an empty footer section.
A.3 Parquet format
Apache ParquetFootnote 18 is introduced in 2013 to provide hybrid layout support for Hadoop echosystem. It divides data horizontally into row groups, whereas each row group is further divided vertically to store columns separately, as sketched in Fig. 20. Additionally, it also divides each vertical partition into multiple pages. Moreover, it also stores the schema and statistical information about the data as meta information in the footer section. All variables specific to Parquet are listed in Table 6.
The header section of Parquet has a fixed size, as stated in Table 6. To estimate the body size, we first need to estimate the total number of row groups (i.e., Eq. 9) and the total rows per row group (i.e., Eq. 18). Moreover, we need to be aware that Parquet stores every individual column divided it into multiple pages, whose number which is estimated by Eq. (35) per row group. Next, we are calculating the body size of Parquet using Eq. (36), by considering metadata for each page (namely definition level and repetition level), and for every row group (namely counter of rows per row group and sync marker).
Finally, we calculate the footer size by approximating the size the of the schema, sketched in Fig. 20, by a constant amount of bytes per column. Moreover, Parquet also stores statistical information about columns in the Footer section for both row groups and data pages. Equation (37) uses all these values together to calculate overall size of footer. Then, total size of Parquet is obtained by adding the header, body and footer sections, as defined in Eq. (1).
About this article
Cite this article
Munir, R.F., Abelló, A., Romero, O. et al. A cost-based storage format selector for materialized results in big data frameworks. Distrib Parallel Databases 38, 335–364 (2020). https://doi.org/10.1007/s10619-019-07271-0
- Big data
- Data-intensive workflows
- Materialized results
- Storage format
- Cost model