Abstract
Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper, we propose a single system called Pangea that can manage all data—both intermediate and long-lived data, and their buffer/caching, page replacement, data placement optimization, and failure recovery—all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.
Similar content being viewed by others
Notes
For example, there are only six active locality sets in a single k-means iteration as illustrated in Fig. 1.
The inverse of the page’s reference distance can be seen as yet another reasonable estimate for \(\lambda \), as this replaces \(t_\mathrm{now} - t_\mathrm{ref}\) with the page’s last observed between-reference time as an estimate for the expected time interval between page references. We choose the time since last reference, however, as it requires only a single reference to be valid.
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P.A., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2–4 2016, pp. 265–283. USENIX Association (2016)
Agrawal, S., Narasayya, V.R., Yang, B.: Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13–18 2004, pp. 359–370. ACM (2004)
Amazon simple storage system. https://aws.amazon.com/s3
Ananthanarayanan, G., Ghodsi, A., Warfield, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: Pacman: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25–27 2012, pp. 267–280. USENIX Association (2012)
Andrei, M., Lemke, C., Radestock, G., Schulze, R., Robert, T., Carsten, B., Rolando, M., Akanksha, S., Muhammad, S., Sebastian, V.: SAP HANA adoption of non-volatile memory. PVLDB 10(12), 1754–1765 (2017)
Apache arrow. https://arrow.apache.org/
Apache ignite. https://ignite.apache.org
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp. 1383–1394. ACM (2015)
Arnold, J.: Openstack Swift: Using, Administering, and Developing for Swift Object Storage. O’Reilly Media, Inc., Newton (2014)
Arulraj, J., Pavlo, A., Malladi, K.T.: Multi-tier buffer management and storage system design for non-volatile memory. CoRR. arXiv:1901.10938 (2019)
Bent, J., Thain, D., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., Livny, M.: Explicit control in the batch-aware distributed file system. In: Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI 2004), March 29–31 2004, San Francisco, California, USA, pp. 365–378. USENIX (2004)
Borthakur, D.: HDFS architecture guide. Hadoop Apache Project. http://hadoop.apache.org/common/docs/current/hdfsdesign.pdf
Bovet, D.P., Cesati, M.: Understanding the Linux Kernel-from I/O Ports to Process Management: Covers Version 2.6, 3rd edn. O’Reilly, Newton (2005)
Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., Haridas, J., Uddaraju, C., Khatri, H., Edwards, A., Bedekar, V., Mainali, S., Abbasi, R., Agarwal, A., ul Haq, M.F., ul Haq, M.I., Bhardwaj, D., Dayanand, S., Adusumilli, A., McNett, M., Sankaran, S., Manivannan, K., Rigas, L.: Windows azure storage: a highly available cloud storage service with strong consistency. In: Proceedings of the 23rd ACM Symposium on Operating Systems Principles 2011, SOSP 2011, Cascais, Portugal, October 23–26 2011, pp. 143–157. ACM (2011)
Cao, P., Irani, S.: Cost-aware WWW proxy caching algorithms. In: 1st USENIX Symposium on Internet Technologies and Systems, USITS’97, Monterey, California, USA, December 8–11 1997. USENIX (1997)
Cao, P., Felten, E.W., Karlin, A.R., Li, K.: Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling. ACM Trans. Comput. Syst. 14(4), 311–343 (1996)
Chaiken, R., Jenkins, B., Larson, P.Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)
Chen, Y., Alspaugh, S., Katz, R.H.: Interactive analytical processing in big data systems: a cross-industry study of mapreduce workloads. PVLDB 5(12), 1802–1813 (2012)
Chou, H.-T., DeWitt, D.J.: An evaluation of buffer management strategies for relational database systems. Algorithmica 1(3), 311–336 (1986)
Crotty, A., Galakatos, A., Dursun, K., Kraska, T., Binnig, C., Çetintemel, U., Zdonik, S.: An architecture for compiling udf-centric workflows. PVLDB 8(12), 1466–1477 (2015)
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)
Ellard, D., Eno T., Margo I.S. et al.: Attribute-based prediction of file properties, Gregory R Ganger (2003). https://dash.harvard.edu/handle/1/25620474
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)
Fagin, R., Price, T.G.: Efficient calculation of expected miss ratios in the independent reference model. SIAM J. Comput. 7(3), 288–297 (1978)
Fitzpatrick, B.: Distributed caching with memcached. Linux J. 2004(124), 5 (2004)
Fonseca, R.C., Almeida, V.A.F., Crovella, M., Abrahao, B.D.: On the intrinsic locality properties of web reference streams. In: Proceedings IEEE INFOCOM 2003, The 22nd Annual Joint Conference of the IEEE Computer and Communications Societies, San Franciso, CA, USA, March 30–April 3, 2003, pp. 448–458. IEEE Computer Society (2003)
Garetto, M., Leonardi, E., Traverso, S.: Efficient analysis of caching strategies under dynamic content popularity. In: 2015 IEEE Conference on Computer Communications, INFOCOM 2015, Kowloon, Hong Kong, April 26–May 1, 2015, pp. 2263–2271. IEEE (2015)
Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles 2003, SOSP 2003, Bolton Landing, NY, USA, October 19–22 2003, pp. 29–43. ACM (2003)
Google cloud storage. https://cloud.google.com/storage
Gupta, K., Jain, R., Koltsidas, I., Pucha, H., Sarkar, P., Seaman, M., Subhraveti, D.: GPFS-SNC: an enterprise storage framework for virtual-machine clouds. IBM J. Res. Dev. 55(6), 2 (2011)
Hash table benchmark. http://incise.org/hash-table-benchmarks.html
Kleinrock, L.: Queueing systems, volume 2. Computer applications, Vol. 66. Wiley, New York (1976)
Jaleel, A., Theobald, K.B., Steely Jr., S.C., Emer, J.S.: High performance cache replacement using re-reference interval prediction (RRIP). In: 37th International Symposium on Computer Architecture (ISCA 2010), June 19–23 2010, Saint-Malo, France, pp. 60–71. ACM (2010)
Jindal, A., Qiao, S., Patel, H., Yin, Z., Di, J., Bag, M., Friedman, M., Lin, Y., Karanasos, K., Rao, S.: Computation reuse in analytics job service at microsoft. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10–15 2018, pp. 191–203. ACM (2018)
Jyothi, S.A., Curino, C., Menache, I., Narayanamurthy, S.M., Tumanov, A., Yaniv, J., Mavlyutov, R., Goiri, I., Krishnan, S., Kulkarni, J., Rao, S.: Morpheus: towards automated slos for enterprise clusters. In: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2–4 2016, pp. 117–134. USENIX Association (2016)
Kimura, H.: FOEDUS: OLTP engine for a thousand cores and NVRAM. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4 2015, pp. 691–706. ACM (2015)
Kornacker, M., Erickson, J.: Cloudera Impala: Real Time Queries in Apache Hadoop, for Real. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real
Lee, D., Choi, J., Kim, J.-H., Noh, S.H., Min, S.L., Cho, Y., Kim, C.-S.: LRFU: a spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Trans. Computers 50(12), 1352–1361 (2001)
Leis, V., Haubenschild, M., Kemper, A., Neumann, T.: Leanstore: in-memory data management beyond main memory. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16–19 2018, pp. 185–196. IEEE Computer Society (2018)
Li, H., Ghodsi, A., Zaharia, M., Shenker, S., Stoica, I.: Tachyon: reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA, November 3–5 2014, pp. 6:1–6:15. ACM (2014)
Li, H.: Alluxio: A Virtual Distributed File System. PhD thesis, University of California, Berkeley, USA (2018). http://www.escholarship.org/uc/item/4n80320w
Liedtke, J.: Toward real microkernels: the inefficient, inflexible first generation inspired development of the vastly improved second generation, which may yet support a variety of operating systems. Commun. ACM 39(9), 70–77 (1996)
Lu, L., Shi, X., Zhou, Y., Zhang, X., Jin, H., Pei, C., He, L., Geng, Y.: Lifetime-based memory management for distributed data processing systems. PVLDB 9(12), 936–947 (2016)
Masmano, M., Ripoll, I., Crespo, A., Real, J.: TLSF: a new dynamic memory allocator for real-time systems. In: Proceedings of the 16th Euromicro Conference on Real-Time Systems (ECRTS 2004), 30 June–2 July 1004, Catania, Italy, pp. 79–86. IEEE Computer Society (2004)
Mesnier, M.P., Thereska, E., Ganger, G.R., Ellard, D., Seltzer, M.I.: File classification in self-* storage systems. In: 1st International Conference on Autonomic Computing (ICAC 2004), 17–19 May 2004, New York, NY, USA, pp. 44–51. IEEE Computer Society (2004)
Morton, A.: Usermode pagecache control: fadvise (). https://linux.die.net/man/2/posix_fadvise
Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H.C., McElroy, R., Paleczny, M., Peek, D., Saab, P., Stafford, D., Tung, T., Venkataramani, V.: Scaling memcache at facebook. In: Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013, Lombard, IL, USA, April 2–5 2013, pp. 385–398. USENIX Association (2013)
O’Neil, E.J., O’Neil, P.E., Weikum, G.: The LRU-K page replacement algorithm for database disk buffering. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26–28 1993, pp. 297–306. ACM Press (1993)
Pai, V.S., Druschel, P., Zwaenepoel, W.: Io-lite: a unified I/O buffering and caching system. In: Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation (OSDI), New Orleans, Louisiana, USA, February 22–25 1999, pp. 15–28. USENIX Association (1999)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Project tungsten: Bringing spark closer to bare metal. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
Ramey, R.: Boost serialization library. www.boost.org/doc/libs/release/libs/serialization
Rao, J., Zhang, C., Megiddo, N., Lohman, G.M.: Automating physical database design in a parallel database. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 3–6 2002, pp. 558–569. ACM (2002)
Sanfilippo, S., Noordhuis, P.: Redis (2009). http://redis.io
Sathiamoorthy, M., Asteris, M., Papailiopoulos, D.S., Dimakis, A.G., Vadali, R., Chen, S., Borthakur, D.: Xoring elephants: novel erasure codes for big data. PVLDB 6(5), 325–336 (2013)
Shanbhag, A., Jindal, A., Madden, S., Quiané-Ruiz, J.-A., Elmore, A.J.: A robust partitioning scheme for ad-hoc query workloads. In: Proceedings of the 2017 Symposium on Cloud Computing, SoCC 2017, Santa Clara, CA, USA, September 24–27 2017, pp. 229–241. ACM (2017)
Sherkat, R., Florendo, C., Andrei, M., Blanco, R., Dragusanu, A., Pathak, A., Khadilkar, P., Kulkarni, N., Lemke, C., Seifert, S., Iyer, S., Gottapu, S., Schulze, R., Gottipati, C., Basak, N., Wang, Y., Kandiyanallur, V., Pendap, S., Gala, D., Almeida, R., Ghosh, P.: Native store extension for SAP HANA. PVLDB 12(12), 2047–2058 (2019)
Shute, J., Vingralek, R., Samwel, B., Handy, B., Whipkey, C., Rollins, E., Oancea, M., Littlefield, K., Menestrina, D., Ellner, S., Cieslewicz, J., Rae, I., Stancescu, T., Apte, H.: F1: a distributed SQL database that scales. PVLDB 6(11), 1068–1079 (2013)
Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: a column-oriented DBMS. In: Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30–September 2 2005, pp. 553–564. ACM (2005)
van Renen, A., Leis, V., Kemper, A., Neumann, T., Hashida, T., Oe, K., Doi, Y., Harada, L., Sato, M.: Managing non-volatile memory in database systems. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10–15 2018, pp. 1541–1555. ACM (2018)
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: 7th Symposium on Operating Systems Design and Implementation (OSDI’06), November 6–8, Seattle, WA, USA, pp. 307–320. USENIX Association (2006)
White, T.: Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale. Revised and Updated, vol. 3. O’Reilly, Sebastopol (2012)
Why enterprises of different sizes are adopting ’fast data’ with apache spark. https://www.lightbend.com/blog/why-enterprises-of-different-sizes-are-adopting-fast-data-with-apache-spark
Wu, M.-J., Zhao, M., Yeung, D.: Studying multicore processor scaling via reuse distance analysis. In: The 40th Annual International Symposium on Computer Architecture, ISCA’13, Tel-Aviv, Israel, June 23–27 2013, pp. 499–510. ACM (2013)
Yi, L., Shanbhag, A., Jindal, A., Madden, S.: Adaptdb: adaptive partitioning for distributed joins. PVLDB 10(5), 589–600 (2017)
Young, N.E.: The k-server dual and loose competitiveness for paging. Algorithmica 11(6), 525–541 (1994)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25–27 2012, pp. 15–28. USENIX Association (2012)
Zhou, J., Bruno, N., Lin, W.: Advanced partitioning techniques for massively distributed computation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20–24 2012, pp. 13–24. ACM (2012)
Zhou, Y., Philbin, J., Li, K.: The multi-queue replacement algorithm for second level buffer caches. In: Proceedings of the General Track: 2001 USENIX Annual Technical Conference, June 25–30 2001, Boston, Massachusetts, USA, pp. 91–104. USENIX (2001)
Zou, J., Barnett, R. M., Lorido-Botran, T., Luo, S., Monroy, C., Sikdar, S., Sourav, T., Kia, T., Binhang, Y., Jermaine, C.: Plinycompute: a platform for high-performance, distributed, data-intensive tool development. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10–15 2018, pp. 1189–1204. ACM (2018)
Zou, J., Iyengar, A., Jermaine, C.: Pangea: monolithic distributed storage for data analytics. PVLDB 12(6), 681–694 (2019)
Acknowledgements
Funding was provide by Defense Advanced Research Projects Agency (Grant No. FA8750-14-2-0270), Directorate for Computer and Information Science and Engineering (Grnat No. 1409543).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zou, J., Iyengar, A. & Jermaine, C. Architecture of a distributed storage that combines file system, memory and computation in a single layer. The VLDB Journal 29, 1049–1073 (2020). https://doi.org/10.1007/s00778-020-00605-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-020-00605-w