Architecture of a distributed storage that combines file system, memory and computation in a single layer

Zou, Jia; Iyengar, Arun; Jermaine, Chris

doi:10.1007/s00778-020-00605-w

Architecture of a distributed storage that combines file system, memory and computation in a single layer

Regular Paper
Published: 26 February 2020

Volume 29, pages 1049–1073, (2020)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Jia Zou¹,
Arun Iyengar² &
Chris Jermaine³

848 Accesses
3 Citations
Explore all metrics

Abstract

Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper, we propose a single system called Pangea that can manage all data—both intermediate and long-lived data, and their buffer/caching, page replacement, data placement optimization, and failure recovery—all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers

Article 01 December 2017

An Overview of the Sirocco Parallel Storage System

An initial evaluation of 6Stor, a dynamically scalable IPv6-centric distributed object storage system

Article 05 January 2019

Notes

For example, there are only six active locality sets in a single k-means iteration as illustrated in Fig. 1.
The inverse of the page’s reference distance can be seen as yet another reasonable estimate for \(\lambda \), as this replaces \(t_\mathrm{now} - t_\mathrm{ref}\) with the page’s last observed between-reference time as an estimate for the expected time interval between page references. We choose the time since last reference, however, as it requires only a single reference to be valid.
https://github.com/ssavvides/tpch-spark.

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P.A., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2–4 2016, pp. 265–283. USENIX Association (2016)
Agrawal, S., Narasayya, V.R., Yang, B.: Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13–18 2004, pp. 359–370. ACM (2004)
Amazon simple storage system. https://aws.amazon.com/s3
Ananthanarayanan, G., Ghodsi, A., Warfield, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: Pacman: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25–27 2012, pp. 267–280. USENIX Association (2012)
Andrei, M., Lemke, C., Radestock, G., Schulze, R., Robert, T., Carsten, B., Rolando, M., Akanksha, S., Muhammad, S., Sebastian, V.: SAP HANA adoption of non-volatile memory. PVLDB 10(12), 1754–1765 (2017)
Google Scholar
Apache arrow. https://arrow.apache.org/
Apache ignite. https://ignite.apache.org
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp. 1383–1394. ACM (2015)
Arnold, J.: Openstack Swift: Using, Administering, and Developing for Swift Object Storage. O’Reilly Media, Inc., Newton (2014)
Google Scholar
Arulraj, J., Pavlo, A., Malladi, K.T.: Multi-tier buffer management and storage system design for non-volatile memory. CoRR. arXiv:1901.10938 (2019)
Bent, J., Thain, D., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., Livny, M.: Explicit control in the batch-aware distributed file system. In: Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI 2004), March 29–31 2004, San Francisco, California, USA, pp. 365–378. USENIX (2004)
Borthakur, D.: HDFS architecture guide. Hadoop Apache Project. http://hadoop.apache.org/common/docs/current/hdfsdesign.pdf
Bovet, D.P., Cesati, M.: Understanding the Linux Kernel-from I/O Ports to Process Management: Covers Version 2.6, 3rd edn. O’Reilly, Newton (2005)
Google Scholar
Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., Haridas, J., Uddaraju, C., Khatri, H., Edwards, A., Bedekar, V., Mainali, S., Abbasi, R., Agarwal, A., ul Haq, M.F., ul Haq, M.I., Bhardwaj, D., Dayanand, S., Adusumilli, A., McNett, M., Sankaran, S., Manivannan, K., Rigas, L.: Windows azure storage: a highly available cloud storage service with strong consistency. In: Proceedings of the 23rd ACM Symposium on Operating Systems Principles 2011, SOSP 2011, Cascais, Portugal, October 23–26 2011, pp. 143–157. ACM (2011)
Cao, P., Irani, S.: Cost-aware WWW proxy caching algorithms. In: 1st USENIX Symposium on Internet Technologies and Systems, USITS’97, Monterey, California, USA, December 8–11 1997. USENIX (1997)
Cao, P., Felten, E.W., Karlin, A.R., Li, K.: Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling. ACM Trans. Comput. Syst. 14(4), 311–343 (1996)
Article Google Scholar
Chaiken, R., Jenkins, B., Larson, P.Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)
Google Scholar
Chen, Y., Alspaugh, S., Katz, R.H.: Interactive analytical processing in big data systems: a cross-industry study of mapreduce workloads. PVLDB 5(12), 1802–1813 (2012)
Google Scholar
Chou, H.-T., DeWitt, D.J.: An evaluation of buffer management strategies for relational database systems. Algorithmica 1(3), 311–336 (1986)
Article MathSciNet Google Scholar
Crotty, A., Galakatos, A., Dursun, K., Kraska, T., Binnig, C., Çetintemel, U., Zdonik, S.: An architecture for compiling udf-centric workflows. PVLDB 8(12), 1466–1477 (2015)
Google Scholar
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)
Google Scholar
Ellard, D., Eno T., Margo I.S. et al.: Attribute-based prediction of file properties, Gregory R Ganger (2003). https://dash.harvard.edu/handle/1/25620474
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)
Google Scholar
Fagin, R., Price, T.G.: Efficient calculation of expected miss ratios in the independent reference model. SIAM J. Comput. 7(3), 288–297 (1978)
Article MathSciNet MATH Google Scholar
Fitzpatrick, B.: Distributed caching with memcached. Linux J. 2004(124), 5 (2004)
Google Scholar
Fonseca, R.C., Almeida, V.A.F., Crovella, M., Abrahao, B.D.: On the intrinsic locality properties of web reference streams. In: Proceedings IEEE INFOCOM 2003, The 22nd Annual Joint Conference of the IEEE Computer and Communications Societies, San Franciso, CA, USA, March 30–April 3, 2003, pp. 448–458. IEEE Computer Society (2003)
Garetto, M., Leonardi, E., Traverso, S.: Efficient analysis of caching strategies under dynamic content popularity. In: 2015 IEEE Conference on Computer Communications, INFOCOM 2015, Kowloon, Hong Kong, April 26–May 1, 2015, pp. 2263–2271. IEEE (2015)
Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles 2003, SOSP 2003, Bolton Landing, NY, USA, October 19–22 2003, pp. 29–43. ACM (2003)
Google cloud storage. https://cloud.google.com/storage
Gupta, K., Jain, R., Koltsidas, I., Pucha, H., Sarkar, P., Seaman, M., Subhraveti, D.: GPFS-SNC: an enterprise storage framework for virtual-machine clouds. IBM J. Res. Dev. 55(6), 2 (2011)
Article Google Scholar
Hash table benchmark. http://incise.org/hash-table-benchmarks.html
Kleinrock, L.: Queueing systems, volume 2. Computer applications, Vol. 66. Wiley, New York (1976)
Jaleel, A., Theobald, K.B., Steely Jr., S.C., Emer, J.S.: High performance cache replacement using re-reference interval prediction (RRIP). In: 37th International Symposium on Computer Architecture (ISCA 2010), June 19–23 2010, Saint-Malo, France, pp. 60–71. ACM (2010)
Jindal, A., Qiao, S., Patel, H., Yin, Z., Di, J., Bag, M., Friedman, M., Lin, Y., Karanasos, K., Rao, S.: Computation reuse in analytics job service at microsoft. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10–15 2018, pp. 191–203. ACM (2018)
Jyothi, S.A., Curino, C., Menache, I., Narayanamurthy, S.M., Tumanov, A., Yaniv, J., Mavlyutov, R., Goiri, I., Krishnan, S., Kulkarni, J., Rao, S.: Morpheus: towards automated slos for enterprise clusters. In: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2–4 2016, pp. 117–134. USENIX Association (2016)
Kimura, H.: FOEDUS: OLTP engine for a thousand cores and NVRAM. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4 2015, pp. 691–706. ACM (2015)
Kornacker, M., Erickson, J.: Cloudera Impala: Real Time Queries in Apache Hadoop, for Real. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real
Lee, D., Choi, J., Kim, J.-H., Noh, S.H., Min, S.L., Cho, Y., Kim, C.-S.: LRFU: a spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Trans. Computers 50(12), 1352–1361 (2001)
Article MathSciNet MATH Google Scholar
Leis, V., Haubenschild, M., Kemper, A., Neumann, T.: Leanstore: in-memory data management beyond main memory. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16–19 2018, pp. 185–196. IEEE Computer Society (2018)
Li, H., Ghodsi, A., Zaharia, M., Shenker, S., Stoica, I.: Tachyon: reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA, November 3–5 2014, pp. 6:1–6:15. ACM (2014)
Li, H.: Alluxio: A Virtual Distributed File System. PhD thesis, University of California, Berkeley, USA (2018). http://www.escholarship.org/uc/item/4n80320w
Liedtke, J.: Toward real microkernels: the inefficient, inflexible first generation inspired development of the vastly improved second generation, which may yet support a variety of operating systems. Commun. ACM 39(9), 70–77 (1996)
Article Google Scholar
Lu, L., Shi, X., Zhou, Y., Zhang, X., Jin, H., Pei, C., He, L., Geng, Y.: Lifetime-based memory management for distributed data processing systems. PVLDB 9(12), 936–947 (2016)
Google Scholar
Masmano, M., Ripoll, I., Crespo, A., Real, J.: TLSF: a new dynamic memory allocator for real-time systems. In: Proceedings of the 16th Euromicro Conference on Real-Time Systems (ECRTS 2004), 30 June–2 July 1004, Catania, Italy, pp. 79–86. IEEE Computer Society (2004)
Mesnier, M.P., Thereska, E., Ganger, G.R., Ellard, D., Seltzer, M.I.: File classification in self-* storage systems. In: 1st International Conference on Autonomic Computing (ICAC 2004), 17–19 May 2004, New York, NY, USA, pp. 44–51. IEEE Computer Society (2004)
Morton, A.: Usermode pagecache control: fadvise (). https://linux.die.net/man/2/posix_fadvise
Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H.C., McElroy, R., Paleczny, M., Peek, D., Saab, P., Stafford, D., Tung, T., Venkataramani, V.: Scaling memcache at facebook. In: Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013, Lombard, IL, USA, April 2–5 2013, pp. 385–398. USENIX Association (2013)
O’Neil, E.J., O’Neil, P.E., Weikum, G.: The LRU-K page replacement algorithm for database disk buffering. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26–28 1993, pp. 297–306. ACM Press (1993)
Pai, V.S., Druschel, P., Zwaenepoel, W.: Io-lite: a unified I/O buffering and caching system. In: Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation (OSDI), New Orleans, Louisiana, USA, February 22–25 1999, pp. 15–28. USENIX Association (1999)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Project tungsten: Bringing spark closer to bare metal. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
Ramey, R.: Boost serialization library. www.boost.org/doc/libs/release/libs/serialization
Rao, J., Zhang, C., Megiddo, N., Lohman, G.M.: Automating physical database design in a parallel database. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 3–6 2002, pp. 558–569. ACM (2002)
Sanfilippo, S., Noordhuis, P.: Redis (2009). http://redis.io
Sathiamoorthy, M., Asteris, M., Papailiopoulos, D.S., Dimakis, A.G., Vadali, R., Chen, S., Borthakur, D.: Xoring elephants: novel erasure codes for big data. PVLDB 6(5), 325–336 (2013)
Google Scholar
Shanbhag, A., Jindal, A., Madden, S., Quiané-Ruiz, J.-A., Elmore, A.J.: A robust partitioning scheme for ad-hoc query workloads. In: Proceedings of the 2017 Symposium on Cloud Computing, SoCC 2017, Santa Clara, CA, USA, September 24–27 2017, pp. 229–241. ACM (2017)
Sherkat, R., Florendo, C., Andrei, M., Blanco, R., Dragusanu, A., Pathak, A., Khadilkar, P., Kulkarni, N., Lemke, C., Seifert, S., Iyer, S., Gottapu, S., Schulze, R., Gottipati, C., Basak, N., Wang, Y., Kandiyanallur, V., Pendap, S., Gala, D., Almeida, R., Ghosh, P.: Native store extension for SAP HANA. PVLDB 12(12), 2047–2058 (2019)
Google Scholar
Shute, J., Vingralek, R., Samwel, B., Handy, B., Whipkey, C., Rollins, E., Oancea, M., Littlefield, K., Menestrina, D., Ellner, S., Cieslewicz, J., Rae, I., Stancescu, T., Apte, H.: F1: a distributed SQL database that scales. PVLDB 6(11), 1068–1079 (2013)
Google Scholar
Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: a column-oriented DBMS. In: Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30–September 2 2005, pp. 553–564. ACM (2005)
van Renen, A., Leis, V., Kemper, A., Neumann, T., Hashida, T., Oe, K., Doi, Y., Harada, L., Sato, M.: Managing non-volatile memory in database systems. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10–15 2018, pp. 1541–1555. ACM (2018)
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: 7th Symposium on Operating Systems Design and Implementation (OSDI’06), November 6–8, Seattle, WA, USA, pp. 307–320. USENIX Association (2006)
White, T.: Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale. Revised and Updated, vol. 3. O’Reilly, Sebastopol (2012)
Google Scholar
Why enterprises of different sizes are adopting ’fast data’ with apache spark. https://www.lightbend.com/blog/why-enterprises-of-different-sizes-are-adopting-fast-data-with-apache-spark
Wu, M.-J., Zhao, M., Yeung, D.: Studying multicore processor scaling via reuse distance analysis. In: The 40th Annual International Symposium on Computer Architecture, ISCA’13, Tel-Aviv, Israel, June 23–27 2013, pp. 499–510. ACM (2013)
Yi, L., Shanbhag, A., Jindal, A., Madden, S.: Adaptdb: adaptive partitioning for distributed joins. PVLDB 10(5), 589–600 (2017)
Google Scholar
Young, N.E.: The k-server dual and loose competitiveness for paging. Algorithmica 11(6), 525–541 (1994)
Article MathSciNet Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25–27 2012, pp. 15–28. USENIX Association (2012)
Zhou, J., Bruno, N., Lin, W.: Advanced partitioning techniques for massively distributed computation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20–24 2012, pp. 13–24. ACM (2012)
Zhou, Y., Philbin, J., Li, K.: The multi-queue replacement algorithm for second level buffer caches. In: Proceedings of the General Track: 2001 USENIX Annual Technical Conference, June 25–30 2001, Boston, Massachusetts, USA, pp. 91–104. USENIX (2001)
Zou, J., Barnett, R. M., Lorido-Botran, T., Luo, S., Monroy, C., Sikdar, S., Sourav, T., Kia, T., Binhang, Y., Jermaine, C.: Plinycompute: a platform for high-performance, distributed, data-intensive tool development. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10–15 2018, pp. 1189–1204. ACM (2018)
Zou, J., Iyengar, A., Jermaine, C.: Pangea: monolithic distributed storage for data analytics. PVLDB 12(6), 681–694 (2019)
Google Scholar

Download references

Acknowledgements

Funding was provide by Defense Advanced Research Projects Agency (Grant No. FA8750-14-2-0270), Directorate for Computer and Information Science and Engineering (Grnat No. 1409543).

Author information

Authors and Affiliations

School of Computing, Informatics, and Decision Systems, Computer Science and Engineering, Arizona State University, Tempe, USA
Jia Zou
IBM T. J. Watson Research Center, Yorktown Heights, USA
Arun Iyengar
Department of Computer Science, Rice University, Houston, USA
Chris Jermaine

Authors

Jia Zou
View author publications
You can also search for this author in PubMed Google Scholar
Arun Iyengar
View author publications
You can also search for this author in PubMed Google Scholar
Chris Jermaine
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jia Zou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zou, J., Iyengar, A. & Jermaine, C. Architecture of a distributed storage that combines file system, memory and computation in a single layer. The VLDB Journal 29, 1049–1073 (2020). https://doi.org/10.1007/s00778-020-00605-w

Download citation

Received: 15 August 2019
Accepted: 13 February 2020
Published: 26 February 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s00778-020-00605-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Architecture of a distributed storage that combines file system, memory and computation in a single layer

Abstract

Access this article

Similar content being viewed by others

ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers

An Overview of the Sirocco Parallel Storage System

An initial evaluation of 6Stor, a dynamically scalable IPv6-centric distributed object storage system

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Architecture of a distributed storage that combines file system, memory and computation in a single layer

Abstract

Access this article

Similar content being viewed by others

ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers

An Overview of the Sirocco Parallel Storage System

An initial evaluation of 6Stor, a dynamically scalable IPv6-centric distributed object storage system

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation