Abstract
Cloud computing systems are widely used to deploy big data-based applications because of their high storage and computation capacity. The key component for storage in cloud computing environment is distributed file system which can store and process data produced by big data-based applications effectively. The users of such big data-based applications issue read requests more frequently when compared to write requests. So, most of these cloud-based applications demand optimal performance from the distributed file system, especially for read operations. Numerous caching and prefetching techniques have been proposed in the existing literature to enhance the performance of distributed file system. However, these techniques typically adopt a synchronous approach, focusing on either application data prefetching or user data prefetching, when the user application starts executing and this may result in an extended read access time. Furthermore, the data is prefetched either based on access frequency or reuse distance with out considering the access recency of data which may result in less cache hit ratio. In this paper, we have proposed application-specific and user-specific data prefetching algorithms for prefetching the data from the distributed file system and storing the same in the multi-level caches present in the distributed file system based on the combination of access frequency and recency ranking of file blocks that were previously accessed by client application programs. Additionally, we have divided the cache into two partitions namely user and application caches to store the prefetched data as per the popularity value calculated by considering user and application level accesses. We have also introduced a parallel read algorithm to read data simultaneously from the multiple caches present in the distributed file system environment. The simulation results demonstrate that, the proposed algorithms improved the distributed file systems performance by minimum of 8 to maximum of 92 percent in terms of average read access time when compared with different existing approaches.
Similar content being viewed by others
Data availability
References
Buhl, H.U., Röglinger, M., Moser, F. and Heidemann, J.: Big data (2013)
Dawodi, M., Hedayati, M.H., Baktash, J.A. and Erfan, A.L.: Facebook mysql performance vs mysql performance. In: 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0103–0109. IEEE (2019)
Stein, T., Chen, E. and Mangla, K.: Facebook immune system. In: Proceedings of the 4th Workshop on Social Network Systems, pp. 1–8 (2011)
Wildani, A., Adams, Adams, I.F.: A case for rigorous workload classification. In: 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 146–149. IEEE (2015)
Liao, J., Chen, S.: Optimization of reading data via classified block access patterns in file systems. IEEE Access 4, 9421–9427 (2016)
Mittal, S.: A survey of recent prefetching techniques for processor caches. ACM Comput. Surv. (CSUR) 49(2), 1–35 (2016)
Ali, W., Shamsuddin, S.M., Ismail, A.S., et al.: A survey of web caching and prefetching. Int. J. Advance. Soft Comput. Appl. 3(1), 18–44 (2011)
Balamash, A., Krunz, M., Nain, P.: Performance analysis of a client-side caching/prefetching system for web traffic. Comput. Netw. 51(13), 3673–3692 (2007)
Kasavajhala, V.: Solid state drive vs. hard disk drive price and performance study. Proc. Dell Tech. White Paper, pp. 8–9, (2011)
Chen, Y., Li, C., Lv, M., Shao, X., Li, Y., Xu, Y.: Explicit data correlations-directed metadata prefetching method in distributed file systems. IEEE Trans. Parallel Distrib. Syst. 30(12), 2692–2705 (2019)
Al Assaf, Maen M., Jiang, Xunfei, Qin, Xiao, Abid, Mohamed Riduan, Qiu, Meikang, Zhang, Jifu: Informed prefetching for distributed multi-level storage systems. J. Signal Processing Syst. 90(4), 619–640 (2018)
Kougkas, A., Devarajan, H., Sun, X.H.: I/o acceleration via multi-tiered data buffering and prefetching. J. Comput. Sci. Technol. 35(1), 92–120 (2020)
Liao, J.: Server-side prefetching in distributed file systems. Concurr. Comput.: Practice Exp. 28(2), 294–310 (2016)
Liao, Jianwei, Trahay, François, Gerofi, Balazs, Ishikawa, Yutaka: Prefetching on storage servers through mining access patterns on blocks. IEEE Trans. parallel Distrib. Syst. 27(9), 2698–2710 (2015)
Liao, J., Trahay, F., Xiao, G., Li, L., Ishikawa, Y.: Performing initiative data prefetching in distributed file systems for cloud computing. IEEE Trans. Cloud Comput. 5(3), 550–562 (2015)
Gopisetty, R., Ragunathan, T. and Bindu, C.S.: Support-based prefetching technique for hierarchical collaborative caching algorithm to improve the performance of a distributed file system. In: 2015 Seventh International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), pp. 97–103. IEEE (2015)
Gopisetty, R., Ragunathan, T., Bindu, C.S.: Improving performance of a distributed file system using hierarchical collaborative global caching algorithm with rank-based replacement technique. Int. J. Commun. Netw. Distrib. Syst. 26(3), 287–318 (2021)
Shin, W., Brumgard, C.D., Xie, B., Vazhkudai, S.S., Ghoshal, D., Oral, S. and Ramakrishnan, L.: Data jockey: Automatic data management for hpc multi-tiered storage systems. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 511–522. IEEE (2019)
Wadhwa, B., Byna, S. and Butt, A.R.: Toward transparent data management in multi-layer storage hierarchy of hpc systems. In: 2018 IEEE International Conference on Cloud Engineering (IC2E), pp. 211–217. IEEE (2018)
He, S., Wang, Y., Li, Z., Sun, X.H., Xu, C.: Cost-aware region-level data placement in multi-tiered parallel i/o systems. IEEE Trans. Parallel Distrib. Syst. 28(7), 1853–1865 (2016)
Ren, Jinting, Chen, Xianzhang, Liu, Duo, Tan, Yujuan, Duan, Moming, Li, Ruolan, Liang, Liang: A machine learning assisted data placement mechanism for hybrid storage systems. J. Syst. Archit. 120, 102295 (2021)
Thomas, L., Gougeaud, S., Rubini, S., Deniel, P. and Boukhobza, J.: Predicting file lifetimes for data placement in multi-tiered storage systems for hpc. In: Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems, pp. 1–9 (2021)
Shi, W., Cheng, P., Zhu, C. and Chen, Z.: An intelligent data placement strategy for hierarchical storage systems. In: 2020 IEEE 6th International Conference on Computer and Communications (ICCC), pp. 2023–2027. IEEE (2020)
Wang, T., Byna, S., Dong, B. and Tang, H.: Univistor: Integrated hierarchical and distributed storage for hpc. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 134–144. IEEE (2018)
Cheng, P., Lu, Y., Du, Y. and Chen, Z.: Accelerating scientific workflows with tiered data management system. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 75–82. IEEE (2018)
Herodotou, H.: Autocache: employing machine learning to automate caching in distributed file systems. In: 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW), pp. 133–139. IEEE (2019)
Yoshimura, T., Chiba, T. and Horii, H.: Column cache: Buffer cache for columnar storage on hdfs. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 282–291. IEEE (2018)
Zhang, X., Liu, B., Gou, Z., Shi, J. and Zhao, X.: Dcache: A distributed cache mechanism for hdfs based on rdma. In: 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 283–291. IEEE (2020)
Nalajala, A., Ragunathan, T., Rajendra, S.H.T., Nikhith, N.V.S. and Gopisetty, R.: Improving performance of distributed file system through frequent block access pattern-based prefetching algorithm. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–7. IEEE (2019)
Nalajala, A., Ragunathan, T., Gopisetty, R. and Garrapally, V.: Rank-based prefetching and multi-level caching algorithms to improve the efficiency of read operations in distributed file systems. In: International Conference on Big Data Analytics, pp. 227–243. Springer (2021)
Jiang, S., Ding, X., Yuehai, X., Davis, K.: A prefetching scheme exploiting both data layout and access history on disk. ACM Trans. Storage (TOS) 9(3), 1–23 (2013)
Li, Z., Chen, Z., Srinivasan, S.M., Zhou, Y., et al.: C-miner: mining block correlations in storage systems. FAST 4, 173–186 (2004)
Li, H., Ghodsi, A., Zaharia, M., Shenker, S. and Stoica, I.: Tachyon: Reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 1–15 (2014)
Yoon, S.-K., Yun, J., Kim, J.-G., Kim, S.-D.: Self-adaptive filtering algorithm with pcm-based memory storage system. ACM Trans. Embed. Comput. Syst. (TECS) 17(3), 1–23 (2018)
Yoon, S.-K., Youn, Y.-S., Burgstaller, B., Kim, S.-D.: Self-learnable cluster-based prefetching method for dram-flash hybrid main memory architecture. ACM J. Emerg. Technol. Comput. Syst. (JETC) 15(1), 1–21 (2019)
Huang, S., Wei, Q., Feng, D., Chen, J., Chen, C.: Improving flash-based disk cache with lazy adaptive replacement. ACM Trans. Storage (TOS) 12(2), 1–24 (2016)
Niu, N., Fangfa, F., Yang, B., Yuan, J., Lai, F., Wang, J.: Wird: an efficiency migration scheme in hybrid dram and pcm main memory for image processing applications. IEEE Access 7, 35941–35951 (2019)
Lee, G.: Data center evolution–mainframes to the cloud. Cloud Networking, Morgan Kaufmann, pp. 11–35 (2014)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. Ieee (2010)
Corsair Vengeance. Corsair vengeance lpx ddr4 3000 c15 2x16gb cmk32gx4m2b3000c15. https://ram.userbenchmark.com/Compare/Corsair.Vengeance-LPX-DDR4-3000-C15-2x16GB-vs-Group-/m92054vs10, (2019)
Intel. List of intel ssds. https://en.wikipedia.org/w/index.php?title=List_of_Intel_SSDs &oldid=898338259, (2019)
seagate. Storage reviews. https://www.storagereview.com/seagate_enterprise_performance_10k_hdd_review, (2015)
Cisco Nexus 5020. Switch performance in market-data and back-office data delivery environments. https://www.cisco.com/c/en/us/products/collateral/switches/nexus-5000-series-switches/white_paper_c11-492751.html, (2019)
Tang, W., Fu, Y., Cherkasova, L., Vahdat, A.: Medisyn: a synthetic streaming media service workload generator. In: Proceedings of the 13th International Workshop on Network and Operating Systems Support for Digital Audio and Video, pp. 12–21 (2003)
Nagaraj, S.V.: Zipf’s law and its role in web caching. Web Caching and its Applications, pp. 165–167 (2004)
Einziger, G., Friedman, R., Manes, B.: Tinylfu: a highly efficient cache admission policy. ACM Trans. Storage (ToS) 13(4), 1–31 (2017)
Salaeva, M.: Uzbek text analysis using zipf distribution. Compu. Linguist.: Probl. Solut. Prospect. 1(1) (2022)
Dutta, N., Patel, S.K., Faragallah, O.S., Baz, M., Rashed, A.N.Z.: Caching scheme for information-centric networks with balanced content distribution. Int. J. Commun. Syst. 35(7), e5104 (2022)
Intel. Resource and design center for development with intel. https://www.intel.com/content/www/us/en/design/resource.design-center.html, (2019)
Funding
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
All authors contributed equally to this work.
Corresponding author
Ethics declarations
Competing interests
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A
Appendix A
The variable definitions used in the proposed algorithms are listed in the following Table 3.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nalajala, A., Ragunathan, T., Naha, R. et al. Application and user-specific data prefetching and parallel read algorithms for distributed file systems. Cluster Comput (2023). https://doi.org/10.1007/s10586-023-04160-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10586-023-04160-1