Application and user-specific data prefetching and parallel read algorithms for distributed file systems

Nalajala, Anusha; Ragunathan, T.; Naha, Ranesh; Battula, Sudheer Kumar

doi:10.1007/s10586-023-04160-1

Application and user-specific data prefetching and parallel read algorithms for distributed file systems

Published: 28 October 2023

(2023)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Anusha Nalajala¹^na1,
T. Ragunathan²^na1,
Ranesh Naha³^na1 &
…
Sudheer Kumar Battula⁴^na1

136 Accesses
Explore all metrics

Abstract

Cloud computing systems are widely used to deploy big data-based applications because of their high storage and computation capacity. The key component for storage in cloud computing environment is distributed file system which can store and process data produced by big data-based applications effectively. The users of such big data-based applications issue read requests more frequently when compared to write requests. So, most of these cloud-based applications demand optimal performance from the distributed file system, especially for read operations. Numerous caching and prefetching techniques have been proposed in the existing literature to enhance the performance of distributed file system. However, these techniques typically adopt a synchronous approach, focusing on either application data prefetching or user data prefetching, when the user application starts executing and this may result in an extended read access time. Furthermore, the data is prefetched either based on access frequency or reuse distance with out considering the access recency of data which may result in less cache hit ratio. In this paper, we have proposed application-specific and user-specific data prefetching algorithms for prefetching the data from the distributed file system and storing the same in the multi-level caches present in the distributed file system based on the combination of access frequency and recency ranking of file blocks that were previously accessed by client application programs. Additionally, we have divided the cache into two partitions namely user and application caches to store the prefetched data as per the popularity value calculated by considering user and application level accesses. We have also introduced a parallel read algorithm to read data simultaneously from the multiple caches present in the distributed file system environment. The simulation results demonstrate that, the proposed algorithms improved the distributed file systems performance by minimum of 8 to maximum of 92 percent in terms of average read access time when compared with different existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Data deduplication techniques for efficient cloud storage management: a systematic review

Article 20 December 2017

A survey on data storage and placement methodologies for Cloud-Big Data ecosystem

Article Open access 11 February 2019

Investigation on storage level data integrity strategies in cloud computing: classification, security obstructions, challenges and vulnerability

Article Open access 15 February 2024

Data availability

https://github.com/NalajalaA/code.git

References

Buhl, H.U., Röglinger, M., Moser, F. and Heidemann, J.: Big data (2013)
Dawodi, M., Hedayati, M.H., Baktash, J.A. and Erfan, A.L.: Facebook mysql performance vs mysql performance. In: 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0103–0109. IEEE (2019)
Stein, T., Chen, E. and Mangla, K.: Facebook immune system. In: Proceedings of the 4th Workshop on Social Network Systems, pp. 1–8 (2011)
Wildani, A., Adams, Adams, I.F.: A case for rigorous workload classification. In: 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 146–149. IEEE (2015)
Liao, J., Chen, S.: Optimization of reading data via classified block access patterns in file systems. IEEE Access 4, 9421–9427 (2016)
Article Google Scholar
Mittal, S.: A survey of recent prefetching techniques for processor caches. ACM Comput. Surv. (CSUR) 49(2), 1–35 (2016)
Article Google Scholar
Ali, W., Shamsuddin, S.M., Ismail, A.S., et al.: A survey of web caching and prefetching. Int. J. Advance. Soft Comput. Appl. 3(1), 18–44 (2011)
Google Scholar
Balamash, A., Krunz, M., Nain, P.: Performance analysis of a client-side caching/prefetching system for web traffic. Comput. Netw. 51(13), 3673–3692 (2007)
Article MATH Google Scholar
Kasavajhala, V.: Solid state drive vs. hard disk drive price and performance study. Proc. Dell Tech. White Paper, pp. 8–9, (2011)
Chen, Y., Li, C., Lv, M., Shao, X., Li, Y., Xu, Y.: Explicit data correlations-directed metadata prefetching method in distributed file systems. IEEE Trans. Parallel Distrib. Syst. 30(12), 2692–2705 (2019)
Article Google Scholar
Al Assaf, Maen M., Jiang, Xunfei, Qin, Xiao, Abid, Mohamed Riduan, Qiu, Meikang, Zhang, Jifu: Informed prefetching for distributed multi-level storage systems. J. Signal Processing Syst. 90(4), 619–640 (2018)
Article Google Scholar
Kougkas, A., Devarajan, H., Sun, X.H.: I/o acceleration via multi-tiered data buffering and prefetching. J. Comput. Sci. Technol. 35(1), 92–120 (2020)
Article Google Scholar
Liao, J.: Server-side prefetching in distributed file systems. Concurr. Comput.: Practice Exp. 28(2), 294–310 (2016)
Article Google Scholar
Liao, Jianwei, Trahay, François, Gerofi, Balazs, Ishikawa, Yutaka: Prefetching on storage servers through mining access patterns on blocks. IEEE Trans. parallel Distrib. Syst. 27(9), 2698–2710 (2015)
Article Google Scholar
Liao, J., Trahay, F., Xiao, G., Li, L., Ishikawa, Y.: Performing initiative data prefetching in distributed file systems for cloud computing. IEEE Trans. Cloud Comput. 5(3), 550–562 (2015)
Article Google Scholar
Gopisetty, R., Ragunathan, T. and Bindu, C.S.: Support-based prefetching technique for hierarchical collaborative caching algorithm to improve the performance of a distributed file system. In: 2015 Seventh International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), pp. 97–103. IEEE (2015)
Gopisetty, R., Ragunathan, T., Bindu, C.S.: Improving performance of a distributed file system using hierarchical collaborative global caching algorithm with rank-based replacement technique. Int. J. Commun. Netw. Distrib. Syst. 26(3), 287–318 (2021)
Google Scholar
Shin, W., Brumgard, C.D., Xie, B., Vazhkudai, S.S., Ghoshal, D., Oral, S. and Ramakrishnan, L.: Data jockey: Automatic data management for hpc multi-tiered storage systems. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 511–522. IEEE (2019)
Wadhwa, B., Byna, S. and Butt, A.R.: Toward transparent data management in multi-layer storage hierarchy of hpc systems. In: 2018 IEEE International Conference on Cloud Engineering (IC2E), pp. 211–217. IEEE (2018)
He, S., Wang, Y., Li, Z., Sun, X.H., Xu, C.: Cost-aware region-level data placement in multi-tiered parallel i/o systems. IEEE Trans. Parallel Distrib. Syst. 28(7), 1853–1865 (2016)
Article Google Scholar
Ren, Jinting, Chen, Xianzhang, Liu, Duo, Tan, Yujuan, Duan, Moming, Li, Ruolan, Liang, Liang: A machine learning assisted data placement mechanism for hybrid storage systems. J. Syst. Archit. 120, 102295 (2021)
Article Google Scholar
Thomas, L., Gougeaud, S., Rubini, S., Deniel, P. and Boukhobza, J.: Predicting file lifetimes for data placement in multi-tiered storage systems for hpc. In: Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems, pp. 1–9 (2021)
Shi, W., Cheng, P., Zhu, C. and Chen, Z.: An intelligent data placement strategy for hierarchical storage systems. In: 2020 IEEE 6th International Conference on Computer and Communications (ICCC), pp. 2023–2027. IEEE (2020)
Wang, T., Byna, S., Dong, B. and Tang, H.: Univistor: Integrated hierarchical and distributed storage for hpc. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 134–144. IEEE (2018)
Cheng, P., Lu, Y., Du, Y. and Chen, Z.: Accelerating scientific workflows with tiered data management system. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 75–82. IEEE (2018)
Herodotou, H.: Autocache: employing machine learning to automate caching in distributed file systems. In: 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW), pp. 133–139. IEEE (2019)
Yoshimura, T., Chiba, T. and Horii, H.: Column cache: Buffer cache for columnar storage on hdfs. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 282–291. IEEE (2018)
Zhang, X., Liu, B., Gou, Z., Shi, J. and Zhao, X.: Dcache: A distributed cache mechanism for hdfs based on rdma. In: 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 283–291. IEEE (2020)
Nalajala, A., Ragunathan, T., Rajendra, S.H.T., Nikhith, N.V.S. and Gopisetty, R.: Improving performance of distributed file system through frequent block access pattern-based prefetching algorithm. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–7. IEEE (2019)
Nalajala, A., Ragunathan, T., Gopisetty, R. and Garrapally, V.: Rank-based prefetching and multi-level caching algorithms to improve the efficiency of read operations in distributed file systems. In: International Conference on Big Data Analytics, pp. 227–243. Springer (2021)
Jiang, S., Ding, X., Yuehai, X., Davis, K.: A prefetching scheme exploiting both data layout and access history on disk. ACM Trans. Storage (TOS) 9(3), 1–23 (2013)
Article Google Scholar
Li, Z., Chen, Z., Srinivasan, S.M., Zhou, Y., et al.: C-miner: mining block correlations in storage systems. FAST 4, 173–186 (2004)
Google Scholar
Li, H., Ghodsi, A., Zaharia, M., Shenker, S. and Stoica, I.: Tachyon: Reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 1–15 (2014)
Yoon, S.-K., Yun, J., Kim, J.-G., Kim, S.-D.: Self-adaptive filtering algorithm with pcm-based memory storage system. ACM Trans. Embed. Comput. Syst. (TECS) 17(3), 1–23 (2018)
Article Google Scholar
Yoon, S.-K., Youn, Y.-S., Burgstaller, B., Kim, S.-D.: Self-learnable cluster-based prefetching method for dram-flash hybrid main memory architecture. ACM J. Emerg. Technol. Comput. Syst. (JETC) 15(1), 1–21 (2019)
Article Google Scholar
Huang, S., Wei, Q., Feng, D., Chen, J., Chen, C.: Improving flash-based disk cache with lazy adaptive replacement. ACM Trans. Storage (TOS) 12(2), 1–24 (2016)
Article Google Scholar
Niu, N., Fangfa, F., Yang, B., Yuan, J., Lai, F., Wang, J.: Wird: an efficiency migration scheme in hybrid dram and pcm main memory for image processing applications. IEEE Access 7, 35941–35951 (2019)
Article Google Scholar
Lee, G.: Data center evolution–mainframes to the cloud. Cloud Networking, Morgan Kaufmann, pp. 11–35 (2014)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. Ieee (2010)
Corsair Vengeance. Corsair vengeance lpx ddr4 3000 c15 2x16gb cmk32gx4m2b3000c15. https://ram.userbenchmark.com/Compare/Corsair.Vengeance-LPX-DDR4-3000-C15-2x16GB-vs-Group-/m92054vs10, (2019)
Intel. List of intel ssds. https://en.wikipedia.org/w/index.php?title=List_of_Intel_SSDs &oldid=898338259, (2019)
seagate. Storage reviews. https://www.storagereview.com/seagate_enterprise_performance_10k_hdd_review, (2015)
Cisco Nexus 5020. Switch performance in market-data and back-office data delivery environments. https://www.cisco.com/c/en/us/products/collateral/switches/nexus-5000-series-switches/white_paper_c11-492751.html, (2019)
Tang, W., Fu, Y., Cherkasova, L., Vahdat, A.: Medisyn: a synthetic streaming media service workload generator. In: Proceedings of the 13th International Workshop on Network and Operating Systems Support for Digital Audio and Video, pp. 12–21 (2003)
Nagaraj, S.V.: Zipf’s law and its role in web caching. Web Caching and its Applications, pp. 165–167 (2004)
Einziger, G., Friedman, R., Manes, B.: Tinylfu: a highly efficient cache admission policy. ACM Trans. Storage (ToS) 13(4), 1–31 (2017)
Article Google Scholar
Salaeva, M.: Uzbek text analysis using zipf distribution. Compu. Linguist.: Probl. Solut. Prospect. 1(1) (2022)
Dutta, N., Patel, S.K., Faragallah, O.S., Baz, M., Rashed, A.N.Z.: Caching scheme for information-centric networks with balanced content distribution. Int. J. Commun. Syst. 35(7), e5104 (2022)
Article Google Scholar
Intel. Resource and design center for development with intel. https://www.intel.com/content/www/us/en/design/resource.design-center.html, (2019)

Download references

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Author information

Anusha Nalajala, T. Ragunathan, Ranesh Naha and Sudheer Kumar Battula have contributed equally to this work.

Authors and Affiliations

Department of CSE, SRM University-AP, Neerukonda, Andhra Pradesh, 522502, India
Anusha Nalajala
Faculty of Engineering and Technology, Sri Ramachandra Institute of Higher Education and Research, Chennai, Tamil Nadu, India
T. Ragunathan
Centre for Smart Analytics, Federation University Australia, Gippsland Campus, Churchill, VIC 3841, Australia
Ranesh Naha
Centre for Smart Analytics, Federation University Australia, Gippsland Campus, Churchill, VIC 3841, Australia
Sudheer Kumar Battula

Authors

Anusha Nalajala
View author publications
You can also search for this author in PubMed Google Scholar
T. Ragunathan
View author publications
You can also search for this author in PubMed Google Scholar
Ranesh Naha
View author publications
You can also search for this author in PubMed Google Scholar
Sudheer Kumar Battula
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally to this work.

Corresponding author

Correspondence to Anusha Nalajala.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

The variable definitions used in the proposed algorithms are listed in the following Table 3.

Table 3 Variable Definitions

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nalajala, A., Ragunathan, T., Naha, R. et al. Application and user-specific data prefetching and parallel read algorithms for distributed file systems. Cluster Comput (2023). https://doi.org/10.1007/s10586-023-04160-1

Download citation

Received: 20 November 2022
Revised: 24 September 2023
Accepted: 25 September 2023
Published: 28 October 2023
DOI: https://doi.org/10.1007/s10586-023-04160-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Application and user-specific data prefetching and parallel read algorithms for distributed file systems

Abstract

Access this article

Similar content being viewed by others

Data deduplication techniques for efficient cloud storage management: a systematic review

A survey on data storage and placement methodologies for Cloud-Big Data ecosystem

Investigation on storage level data integrity strategies in cloud computing: classification, security obstructions, challenges and vulnerability

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendix A

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Application and user-specific data prefetching and parallel read algorithms for distributed file systems

Abstract

Access this article

Similar content being viewed by others

Data deduplication techniques for efficient cloud storage management: a systematic review

A survey on data storage and placement methodologies for Cloud-Big Data ecosystem

Investigation on storage level data integrity strategies in cloud computing: classification, security obstructions, challenges and vulnerability

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendix A

Appendix A

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation