Skip to main content
Log in

Application and user-specific data prefetching and parallel read algorithms for distributed file systems

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Cloud computing systems are widely used to deploy big data-based applications because of their high storage and computation capacity. The key component for storage in cloud computing environment is distributed file system which can store and process data produced by big data-based applications effectively. The users of such big data-based applications issue read requests more frequently when compared to write requests. So, most of these cloud-based applications demand optimal performance from the distributed file system, especially for read operations. Numerous caching and prefetching techniques have been proposed in the existing literature to enhance the performance of distributed file system. However, these techniques typically adopt a synchronous approach, focusing on either application data prefetching or user data prefetching, when the user application starts executing and this may result in an extended read access time. Furthermore, the data is prefetched either based on access frequency or reuse distance with out considering the access recency of data which may result in less cache hit ratio. In this paper, we have proposed application-specific and user-specific data prefetching algorithms for prefetching the data from the distributed file system and storing the same in the multi-level caches present in the distributed file system based on the combination of access frequency and recency ranking of file blocks that were previously accessed by client application programs. Additionally, we have divided the cache into two partitions namely user and application caches to store the prefetched data as per the popularity value calculated by considering user and application level accesses. We have also introduced a parallel read algorithm to read data simultaneously from the multiple caches present in the distributed file system environment. The simulation results demonstrate that, the proposed algorithms improved the distributed file systems performance by minimum of 8 to maximum of 92 percent in terms of average read access time when compared with different existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

https://github.com/NalajalaA/code.git

References

  1. Buhl, H.U., Röglinger, M., Moser, F. and Heidemann, J.: Big data (2013)

  2. Dawodi, M., Hedayati, M.H., Baktash, J.A. and Erfan, A.L.: Facebook mysql performance vs mysql performance. In: 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0103–0109. IEEE (2019)

  3. Stein, T., Chen, E. and Mangla, K.: Facebook immune system. In: Proceedings of the 4th Workshop on Social Network Systems, pp. 1–8 (2011)

  4. Wildani, A., Adams, Adams, I.F.: A case for rigorous workload classification. In: 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 146–149. IEEE (2015)

  5. Liao, J., Chen, S.: Optimization of reading data via classified block access patterns in file systems. IEEE Access 4, 9421–9427 (2016)

    Article  Google Scholar 

  6. Mittal, S.: A survey of recent prefetching techniques for processor caches. ACM Comput. Surv. (CSUR) 49(2), 1–35 (2016)

    Article  Google Scholar 

  7. Ali, W., Shamsuddin, S.M., Ismail, A.S., et al.: A survey of web caching and prefetching. Int. J. Advance. Soft Comput. Appl. 3(1), 18–44 (2011)

    Google Scholar 

  8. Balamash, A., Krunz, M., Nain, P.: Performance analysis of a client-side caching/prefetching system for web traffic. Comput. Netw. 51(13), 3673–3692 (2007)

    Article  MATH  Google Scholar 

  9. Kasavajhala, V.: Solid state drive vs. hard disk drive price and performance study. Proc. Dell Tech. White Paper, pp. 8–9, (2011)

  10. Chen, Y., Li, C., Lv, M., Shao, X., Li, Y., Xu, Y.: Explicit data correlations-directed metadata prefetching method in distributed file systems. IEEE Trans. Parallel Distrib. Syst. 30(12), 2692–2705 (2019)

    Article  Google Scholar 

  11. Al Assaf, Maen M., Jiang, Xunfei, Qin, Xiao, Abid, Mohamed Riduan, Qiu, Meikang, Zhang, Jifu: Informed prefetching for distributed multi-level storage systems. J. Signal Processing Syst. 90(4), 619–640 (2018)

    Article  Google Scholar 

  12. Kougkas, A., Devarajan, H., Sun, X.H.: I/o acceleration via multi-tiered data buffering and prefetching. J. Comput. Sci. Technol. 35(1), 92–120 (2020)

    Article  Google Scholar 

  13. Liao, J.: Server-side prefetching in distributed file systems. Concurr. Comput.: Practice Exp. 28(2), 294–310 (2016)

    Article  Google Scholar 

  14. Liao, Jianwei, Trahay, François, Gerofi, Balazs, Ishikawa, Yutaka: Prefetching on storage servers through mining access patterns on blocks. IEEE Trans. parallel Distrib. Syst. 27(9), 2698–2710 (2015)

    Article  Google Scholar 

  15. Liao, J., Trahay, F., Xiao, G., Li, L., Ishikawa, Y.: Performing initiative data prefetching in distributed file systems for cloud computing. IEEE Trans. Cloud Comput. 5(3), 550–562 (2015)

    Article  Google Scholar 

  16. Gopisetty, R., Ragunathan, T. and Bindu, C.S.: Support-based prefetching technique for hierarchical collaborative caching algorithm to improve the performance of a distributed file system. In: 2015 Seventh International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), pp. 97–103. IEEE (2015)

  17. Gopisetty, R., Ragunathan, T., Bindu, C.S.: Improving performance of a distributed file system using hierarchical collaborative global caching algorithm with rank-based replacement technique. Int. J. Commun. Netw. Distrib. Syst. 26(3), 287–318 (2021)

    Google Scholar 

  18. Shin, W., Brumgard, C.D., Xie, B., Vazhkudai, S.S., Ghoshal, D., Oral, S. and Ramakrishnan, L.: Data jockey: Automatic data management for hpc multi-tiered storage systems. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 511–522. IEEE (2019)

  19. Wadhwa, B., Byna, S. and Butt, A.R.: Toward transparent data management in multi-layer storage hierarchy of hpc systems. In: 2018 IEEE International Conference on Cloud Engineering (IC2E), pp. 211–217. IEEE (2018)

  20. He, S., Wang, Y., Li, Z., Sun, X.H., Xu, C.: Cost-aware region-level data placement in multi-tiered parallel i/o systems. IEEE Trans. Parallel Distrib. Syst. 28(7), 1853–1865 (2016)

    Article  Google Scholar 

  21. Ren, Jinting, Chen, Xianzhang, Liu, Duo, Tan, Yujuan, Duan, Moming, Li, Ruolan, Liang, Liang: A machine learning assisted data placement mechanism for hybrid storage systems. J. Syst. Archit. 120, 102295 (2021)

    Article  Google Scholar 

  22. Thomas, L., Gougeaud, S., Rubini, S., Deniel, P. and Boukhobza, J.: Predicting file lifetimes for data placement in multi-tiered storage systems for hpc. In: Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems, pp. 1–9 (2021)

  23. Shi, W., Cheng, P., Zhu, C. and Chen, Z.: An intelligent data placement strategy for hierarchical storage systems. In: 2020 IEEE 6th International Conference on Computer and Communications (ICCC), pp. 2023–2027. IEEE (2020)

  24. Wang, T., Byna, S., Dong, B. and Tang, H.: Univistor: Integrated hierarchical and distributed storage for hpc. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 134–144. IEEE (2018)

  25. Cheng, P., Lu, Y., Du, Y. and Chen, Z.: Accelerating scientific workflows with tiered data management system. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 75–82. IEEE (2018)

  26. Herodotou, H.: Autocache: employing machine learning to automate caching in distributed file systems. In: 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW), pp. 133–139. IEEE (2019)

  27. Yoshimura, T., Chiba, T. and Horii, H.: Column cache: Buffer cache for columnar storage on hdfs. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 282–291. IEEE (2018)

  28. Zhang, X., Liu, B., Gou, Z., Shi, J. and Zhao, X.: Dcache: A distributed cache mechanism for hdfs based on rdma. In: 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 283–291. IEEE (2020)

  29. Nalajala, A., Ragunathan, T., Rajendra, S.H.T., Nikhith, N.V.S. and Gopisetty, R.: Improving performance of distributed file system through frequent block access pattern-based prefetching algorithm. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–7. IEEE (2019)

  30. Nalajala, A., Ragunathan, T., Gopisetty, R. and Garrapally, V.: Rank-based prefetching and multi-level caching algorithms to improve the efficiency of read operations in distributed file systems. In: International Conference on Big Data Analytics, pp. 227–243. Springer (2021)

  31. Jiang, S., Ding, X., Yuehai, X., Davis, K.: A prefetching scheme exploiting both data layout and access history on disk. ACM Trans. Storage (TOS) 9(3), 1–23 (2013)

    Article  Google Scholar 

  32. Li, Z., Chen, Z., Srinivasan, S.M., Zhou, Y., et al.: C-miner: mining block correlations in storage systems. FAST 4, 173–186 (2004)

    Google Scholar 

  33. Li, H., Ghodsi, A., Zaharia, M., Shenker, S. and Stoica, I.: Tachyon: Reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 1–15 (2014)

  34. Yoon, S.-K., Yun, J., Kim, J.-G., Kim, S.-D.: Self-adaptive filtering algorithm with pcm-based memory storage system. ACM Trans. Embed. Comput. Syst. (TECS) 17(3), 1–23 (2018)

    Article  Google Scholar 

  35. Yoon, S.-K., Youn, Y.-S., Burgstaller, B., Kim, S.-D.: Self-learnable cluster-based prefetching method for dram-flash hybrid main memory architecture. ACM J. Emerg. Technol. Comput. Syst. (JETC) 15(1), 1–21 (2019)

    Article  Google Scholar 

  36. Huang, S., Wei, Q., Feng, D., Chen, J., Chen, C.: Improving flash-based disk cache with lazy adaptive replacement. ACM Trans. Storage (TOS) 12(2), 1–24 (2016)

    Article  Google Scholar 

  37. Niu, N., Fangfa, F., Yang, B., Yuan, J., Lai, F., Wang, J.: Wird: an efficiency migration scheme in hybrid dram and pcm main memory for image processing applications. IEEE Access 7, 35941–35951 (2019)

    Article  Google Scholar 

  38. Lee, G.: Data center evolution–mainframes to the cloud. Cloud Networking, Morgan Kaufmann, pp. 11–35 (2014)

  39. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. Ieee (2010)

  40. Corsair Vengeance. Corsair vengeance lpx ddr4 3000 c15 2x16gb cmk32gx4m2b3000c15. https://ram.userbenchmark.com/Compare/Corsair.Vengeance-LPX-DDR4-3000-C15-2x16GB-vs-Group-/m92054vs10, (2019)

  41. Intel. List of intel ssds. https://en.wikipedia.org/w/index.php?title=List_of_Intel_SSDs &oldid=898338259, (2019)

  42. seagate. Storage reviews. https://www.storagereview.com/seagate_enterprise_performance_10k_hdd_review, (2015)

  43. Cisco Nexus 5020. Switch performance in market-data and back-office data delivery environments. https://www.cisco.com/c/en/us/products/collateral/switches/nexus-5000-series-switches/white_paper_c11-492751.html, (2019)

  44. Tang, W., Fu, Y., Cherkasova, L., Vahdat, A.: Medisyn: a synthetic streaming media service workload generator. In: Proceedings of the 13th International Workshop on Network and Operating Systems Support for Digital Audio and Video, pp. 12–21 (2003)

  45. Nagaraj, S.V.: Zipf’s law and its role in web caching. Web Caching and its Applications, pp. 165–167 (2004)

  46. Einziger, G., Friedman, R., Manes, B.: Tinylfu: a highly efficient cache admission policy. ACM Trans. Storage (ToS) 13(4), 1–31 (2017)

    Article  Google Scholar 

  47. Salaeva, M.: Uzbek text analysis using zipf distribution. Compu. Linguist.: Probl. Solut. Prospect. 1(1) (2022)

  48. Dutta, N., Patel, S.K., Faragallah, O.S., Baz, M., Rashed, A.N.Z.: Caching scheme for information-centric networks with balanced content distribution. Int. J. Commun. Syst. 35(7), e5104 (2022)

    Article  Google Scholar 

  49. Intel. Resource and design center for development with intel. https://www.intel.com/content/www/us/en/design/resource.design-center.html, (2019)

Download references

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed equally to this work.

Corresponding author

Correspondence to Anusha Nalajala.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

Appendix A

The variable definitions used in the proposed algorithms are listed in the following Table 3.

Table 3 Variable Definitions

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nalajala, A., Ragunathan, T., Naha, R. et al. Application and user-specific data prefetching and parallel read algorithms for distributed file systems. Cluster Comput (2023). https://doi.org/10.1007/s10586-023-04160-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10586-023-04160-1

Keywords

Navigation