Skip to main content

Architecture of a distributed storage that combines file system, memory and computation in a single layer

Abstract

Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper, we propose a single system called Pangea that can manage all data—both intermediate and long-lived data, and their buffer/caching, page replacement, data placement optimization, and failure recovery—all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Notes

  1. For example, there are only six active locality sets in a single k-means iteration as illustrated in Fig. 1.

  2. The inverse of the page’s reference distance can be seen as yet another reasonable estimate for \(\lambda \), as this replaces \(t_\mathrm{now} - t_\mathrm{ref}\) with the page’s last observed between-reference time as an estimate for the expected time interval between page references. We choose the time since last reference, however, as it requires only a single reference to be valid.

  3. https://github.com/ssavvides/tpch-spark.

References

  1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P.A., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2–4 2016, pp. 265–283. USENIX Association (2016)

  2. Agrawal, S., Narasayya, V.R., Yang, B.: Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13–18 2004, pp. 359–370. ACM (2004)

  3. Amazon simple storage system. https://aws.amazon.com/s3

  4. Ananthanarayanan, G., Ghodsi, A., Warfield, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: Pacman: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25–27 2012, pp. 267–280. USENIX Association (2012)

  5. Andrei, M., Lemke, C., Radestock, G., Schulze, R., Robert, T., Carsten, B., Rolando, M., Akanksha, S., Muhammad, S., Sebastian, V.: SAP HANA adoption of non-volatile memory. PVLDB 10(12), 1754–1765 (2017)

    Google Scholar 

  6. Apache arrow. https://arrow.apache.org/

  7. Apache ignite. https://ignite.apache.org

  8. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp. 1383–1394. ACM (2015)

  9. Arnold, J.: Openstack Swift: Using, Administering, and Developing for Swift Object Storage. O’Reilly Media, Inc., Newton (2014)

    Google Scholar 

  10. Arulraj, J., Pavlo, A., Malladi, K.T.: Multi-tier buffer management and storage system design for non-volatile memory. CoRR. arXiv:1901.10938 (2019)

  11. Bent, J., Thain, D., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., Livny, M.: Explicit control in the batch-aware distributed file system. In: Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI 2004), March 29–31 2004, San Francisco, California, USA, pp. 365–378. USENIX (2004)

  12. Borthakur, D.: HDFS architecture guide. Hadoop Apache Project. http://hadoop.apache.org/common/docs/current/hdfsdesign.pdf

  13. Bovet, D.P., Cesati, M.: Understanding the Linux Kernel-from I/O Ports to Process Management: Covers Version 2.6, 3rd edn. O’Reilly, Newton (2005)

    Google Scholar 

  14. Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., Haridas, J., Uddaraju, C., Khatri, H., Edwards, A., Bedekar, V., Mainali, S., Abbasi, R., Agarwal, A., ul Haq, M.F., ul Haq, M.I., Bhardwaj, D., Dayanand, S., Adusumilli, A., McNett, M., Sankaran, S., Manivannan, K., Rigas, L.: Windows azure storage: a highly available cloud storage service with strong consistency. In: Proceedings of the 23rd ACM Symposium on Operating Systems Principles 2011, SOSP 2011, Cascais, Portugal, October 23–26 2011, pp. 143–157. ACM (2011)

  15. Cao, P., Irani, S.: Cost-aware WWW proxy caching algorithms. In: 1st USENIX Symposium on Internet Technologies and Systems, USITS’97, Monterey, California, USA, December 8–11 1997. USENIX (1997)

  16. Cao, P., Felten, E.W., Karlin, A.R., Li, K.: Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling. ACM Trans. Comput. Syst. 14(4), 311–343 (1996)

    Article  Google Scholar 

  17. Chaiken, R., Jenkins, B., Larson, P.Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)

    Google Scholar 

  18. Chen, Y., Alspaugh, S., Katz, R.H.: Interactive analytical processing in big data systems: a cross-industry study of mapreduce workloads. PVLDB 5(12), 1802–1813 (2012)

    Google Scholar 

  19. Chou, H.-T., DeWitt, D.J.: An evaluation of buffer management strategies for relational database systems. Algorithmica 1(3), 311–336 (1986)

    MathSciNet  Article  Google Scholar 

  20. Crotty, A., Galakatos, A., Dursun, K., Kraska, T., Binnig, C., Çetintemel, U., Zdonik, S.: An architecture for compiling udf-centric workflows. PVLDB 8(12), 1466–1477 (2015)

    Google Scholar 

  21. Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)

    Google Scholar 

  22. Ellard, D., Eno T., Margo I.S. et al.: Attribute-based prediction of file properties, Gregory R Ganger (2003). https://dash.harvard.edu/handle/1/25620474

  23. Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)

    Google Scholar 

  24. Fagin, R., Price, T.G.: Efficient calculation of expected miss ratios in the independent reference model. SIAM J. Comput. 7(3), 288–297 (1978)

    MathSciNet  MATH  Article  Google Scholar 

  25. Fitzpatrick, B.: Distributed caching with memcached. Linux J. 2004(124), 5 (2004)

    Google Scholar 

  26. Fonseca, R.C., Almeida, V.A.F., Crovella, M., Abrahao, B.D.: On the intrinsic locality properties of web reference streams. In: Proceedings IEEE INFOCOM 2003, The 22nd Annual Joint Conference of the IEEE Computer and Communications Societies, San Franciso, CA, USA, March 30–April 3, 2003, pp. 448–458. IEEE Computer Society (2003)

  27. Garetto, M., Leonardi, E., Traverso, S.: Efficient analysis of caching strategies under dynamic content popularity. In: 2015 IEEE Conference on Computer Communications, INFOCOM 2015, Kowloon, Hong Kong, April 26–May 1, 2015, pp. 2263–2271. IEEE (2015)

  28. Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles 2003, SOSP 2003, Bolton Landing, NY, USA, October 19–22 2003, pp. 29–43. ACM (2003)

  29. Google cloud storage. https://cloud.google.com/storage

  30. Gupta, K., Jain, R., Koltsidas, I., Pucha, H., Sarkar, P., Seaman, M., Subhraveti, D.: GPFS-SNC: an enterprise storage framework for virtual-machine clouds. IBM J. Res. Dev. 55(6), 2 (2011)

    Article  Google Scholar 

  31. Hash table benchmark. http://incise.org/hash-table-benchmarks.html

  32. Kleinrock, L.: Queueing systems, volume 2. Computer applications, Vol. 66. Wiley, New York (1976)

  33. Jaleel, A., Theobald, K.B., Steely Jr., S.C., Emer, J.S.: High performance cache replacement using re-reference interval prediction (RRIP). In: 37th International Symposium on Computer Architecture (ISCA 2010), June 19–23 2010, Saint-Malo, France, pp. 60–71. ACM (2010)

  34. Jindal, A., Qiao, S., Patel, H., Yin, Z., Di, J., Bag, M., Friedman, M., Lin, Y., Karanasos, K., Rao, S.: Computation reuse in analytics job service at microsoft. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10–15 2018, pp. 191–203. ACM (2018)

  35. Jyothi, S.A., Curino, C., Menache, I., Narayanamurthy, S.M., Tumanov, A., Yaniv, J., Mavlyutov, R., Goiri, I., Krishnan, S., Kulkarni, J., Rao, S.: Morpheus: towards automated slos for enterprise clusters. In: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2–4 2016, pp. 117–134. USENIX Association (2016)

  36. Kimura, H.: FOEDUS: OLTP engine for a thousand cores and NVRAM. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4 2015, pp. 691–706. ACM (2015)

  37. Kornacker, M., Erickson, J.: Cloudera Impala: Real Time Queries in Apache Hadoop, for Real. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real

  38. Lee, D., Choi, J., Kim, J.-H., Noh, S.H., Min, S.L., Cho, Y., Kim, C.-S.: LRFU: a spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Trans. Computers 50(12), 1352–1361 (2001)

    MathSciNet  MATH  Article  Google Scholar 

  39. Leis, V., Haubenschild, M., Kemper, A., Neumann, T.: Leanstore: in-memory data management beyond main memory. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16–19 2018, pp. 185–196. IEEE Computer Society (2018)

  40. Li, H., Ghodsi, A., Zaharia, M., Shenker, S., Stoica, I.: Tachyon: reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA, November 3–5 2014, pp. 6:1–6:15. ACM (2014)

  41. Li, H.: Alluxio: A Virtual Distributed File System. PhD thesis, University of California, Berkeley, USA (2018). http://www.escholarship.org/uc/item/4n80320w

  42. Liedtke, J.: Toward real microkernels: the inefficient, inflexible first generation inspired development of the vastly improved second generation, which may yet support a variety of operating systems. Commun. ACM 39(9), 70–77 (1996)

    Article  Google Scholar 

  43. Lu, L., Shi, X., Zhou, Y., Zhang, X., Jin, H., Pei, C., He, L., Geng, Y.: Lifetime-based memory management for distributed data processing systems. PVLDB 9(12), 936–947 (2016)

    Google Scholar 

  44. Masmano, M., Ripoll, I., Crespo, A., Real, J.: TLSF: a new dynamic memory allocator for real-time systems. In: Proceedings of the 16th Euromicro Conference on Real-Time Systems (ECRTS 2004), 30 June–2 July 1004, Catania, Italy, pp. 79–86. IEEE Computer Society (2004)

  45. Mesnier, M.P., Thereska, E., Ganger, G.R., Ellard, D., Seltzer, M.I.: File classification in self-* storage systems. In: 1st International Conference on Autonomic Computing (ICAC 2004), 17–19 May 2004, New York, NY, USA, pp. 44–51. IEEE Computer Society (2004)

  46. Morton, A.: Usermode pagecache control: fadvise (). https://linux.die.net/man/2/posix_fadvise

  47. Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H.C., McElroy, R., Paleczny, M., Peek, D., Saab, P., Stafford, D., Tung, T., Venkataramani, V.: Scaling memcache at facebook. In: Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013, Lombard, IL, USA, April 2–5 2013, pp. 385–398. USENIX Association (2013)

  48. O’Neil, E.J., O’Neil, P.E., Weikum, G.: The LRU-K page replacement algorithm for database disk buffering. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26–28 1993, pp. 297–306. ACM Press (1993)

  49. Pai, V.S., Druschel, P., Zwaenepoel, W.: Io-lite: a unified I/O buffering and caching system. In: Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation (OSDI), New Orleans, Louisiana, USA, February 22–25 1999, pp. 15–28. USENIX Association (1999)

  50. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  51. Project tungsten: Bringing spark closer to bare metal. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

  52. Ramey, R.: Boost serialization library. www.boost.org/doc/libs/release/libs/serialization

  53. Rao, J., Zhang, C., Megiddo, N., Lohman, G.M.: Automating physical database design in a parallel database. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 3–6 2002, pp. 558–569. ACM (2002)

  54. Sanfilippo, S., Noordhuis, P.: Redis (2009). http://redis.io

  55. Sathiamoorthy, M., Asteris, M., Papailiopoulos, D.S., Dimakis, A.G., Vadali, R., Chen, S., Borthakur, D.: Xoring elephants: novel erasure codes for big data. PVLDB 6(5), 325–336 (2013)

    Google Scholar 

  56. Shanbhag, A., Jindal, A., Madden, S., Quiané-Ruiz, J.-A., Elmore, A.J.: A robust partitioning scheme for ad-hoc query workloads. In: Proceedings of the 2017 Symposium on Cloud Computing, SoCC 2017, Santa Clara, CA, USA, September 24–27 2017, pp. 229–241. ACM (2017)

  57. Sherkat, R., Florendo, C., Andrei, M., Blanco, R., Dragusanu, A., Pathak, A., Khadilkar, P., Kulkarni, N., Lemke, C., Seifert, S., Iyer, S., Gottapu, S., Schulze, R., Gottipati, C., Basak, N., Wang, Y., Kandiyanallur, V., Pendap, S., Gala, D., Almeida, R., Ghosh, P.: Native store extension for SAP HANA. PVLDB 12(12), 2047–2058 (2019)

    Google Scholar 

  58. Shute, J., Vingralek, R., Samwel, B., Handy, B., Whipkey, C., Rollins, E., Oancea, M., Littlefield, K., Menestrina, D., Ellner, S., Cieslewicz, J., Rae, I., Stancescu, T., Apte, H.: F1: a distributed SQL database that scales. PVLDB 6(11), 1068–1079 (2013)

    Google Scholar 

  59. Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: a column-oriented DBMS. In: Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30–September 2 2005, pp. 553–564. ACM (2005)

  60. van Renen, A., Leis, V., Kemper, A., Neumann, T., Hashida, T., Oe, K., Doi, Y., Harada, L., Sato, M.: Managing non-volatile memory in database systems. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10–15 2018, pp. 1541–1555. ACM (2018)

  61. Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: 7th Symposium on Operating Systems Design and Implementation (OSDI’06), November 6–8, Seattle, WA, USA, pp. 307–320. USENIX Association (2006)

  62. White, T.: Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale. Revised and Updated, vol. 3. O’Reilly, Sebastopol (2012)

    Google Scholar 

  63. Why enterprises of different sizes are adopting ’fast data’ with apache spark. https://www.lightbend.com/blog/why-enterprises-of-different-sizes-are-adopting-fast-data-with-apache-spark

  64. Wu, M.-J., Zhao, M., Yeung, D.: Studying multicore processor scaling via reuse distance analysis. In: The 40th Annual International Symposium on Computer Architecture, ISCA’13, Tel-Aviv, Israel, June 23–27 2013, pp. 499–510. ACM (2013)

  65. Yi, L., Shanbhag, A., Jindal, A., Madden, S.: Adaptdb: adaptive partitioning for distributed joins. PVLDB 10(5), 589–600 (2017)

    Google Scholar 

  66. Young, N.E.: The k-server dual and loose competitiveness for paging. Algorithmica 11(6), 525–541 (1994)

    MathSciNet  Article  Google Scholar 

  67. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25–27 2012, pp. 15–28. USENIX Association (2012)

  68. Zhou, J., Bruno, N., Lin, W.: Advanced partitioning techniques for massively distributed computation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20–24 2012, pp. 13–24. ACM (2012)

  69. Zhou, Y., Philbin, J., Li, K.: The multi-queue replacement algorithm for second level buffer caches. In: Proceedings of the General Track: 2001 USENIX Annual Technical Conference, June 25–30 2001, Boston, Massachusetts, USA, pp. 91–104. USENIX (2001)

  70. Zou, J., Barnett, R. M., Lorido-Botran, T., Luo, S., Monroy, C., Sikdar, S., Sourav, T., Kia, T., Binhang, Y., Jermaine, C.: Plinycompute: a platform for high-performance, distributed, data-intensive tool development. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10–15 2018, pp. 1189–1204. ACM (2018)

  71. Zou, J., Iyengar, A., Jermaine, C.: Pangea: monolithic distributed storage for data analytics. PVLDB 12(6), 681–694 (2019)

    Google Scholar 

Download references

Acknowledgements

Funding was provide by Defense Advanced Research Projects Agency (Grant No. FA8750-14-2-0270), Directorate for Computer and Information Science and Engineering (Grnat No. 1409543).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia Zou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zou, J., Iyengar, A. & Jermaine, C. Architecture of a distributed storage that combines file system, memory and computation in a single layer. The VLDB Journal 29, 1049–1073 (2020). https://doi.org/10.1007/s00778-020-00605-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-020-00605-w

Keywords

  • Distributed system
  • Monolithic storage
  • Big Data analytics
  • Heterogeneous replication