Skip to main content
Log in

Emerging Research Trends in Data Deduplication: A Bibliometric Analysis from 2010 to 2023

  • Review article
  • Published:
Archives of Computational Methods in Engineering Aims and scope Submit manuscript

Abstract

In the present time of industry and academia, the demand for efficient utilization of data storage needs to be taken into account, as lots of duplicate data on the cloud lead to a waste of storage space. Therefore, resulting in a need to explore and propose algorithms to increase the efficiency of storage space on the cloud. Data deduplication is a technique to turn out the need for managing the storage efficiently by removing duplicate data. It is important to study the existing state of art techniques of deduplication available in the literature that solves the storage problem. This paper discusses the impact on research via bibliometric analysis of the data deduplication for a time period from 2010 to 2023. This bibliometric analysis is based on samples of 461 documents taken from the Scopus database. Bibliometric review is done via the Biblioshiny application which is included in the Bibliometric package found in the R language. An analysis is carried out on various aspects such as annual scientific production, total citations per year, authors and documents citations, common key terms, highlights of the relevant authors and sources, and analysis of trending topics in relevant field. The inferred results are structured and organized in such a way as to help researchers in the future by providing directions for them to explore various options. The findings demonstrate that as research advances, experts pay greater attention to the consequences of duplicate data in the cloud brought by the data deduplication process and the research goals are getting more focused.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

Data Availability

Data is available from the authors upon reasonable request from the corresponding author.

Abbreviations

PRISMA:

Preferred Reporting Items for Systematic Reviews and Meta-Analyses

CSV:

Comma-Separated Value

References

  1. Rasina Begum B, Chitra P (2021) SEEDDUP: a three-tier secure data deduplication architecture-based storage and retrieval for cross-domains over cloud. IETE J Res. https://doi.org/10.1080/03772063.2021.1886882

    Article  Google Scholar 

  2. Mao Z, Xue Y, Wang H, Ou W (2019) Research on big data encryption algorithms based on data deduplication technology. In: 2019 international conference on electronic engineering and informatics (EEI). pp 520–522. https://doi.org/10.1109/EEI48997.2019.00118

  3. Malathi P, Suganthidevi S (2021) Comparative study and secure data deduplication techniques for cloud computing storage. In: 2021 international conference on innovative computing, intelligent communication and smart electrical systems (ICSES). pp 1–5. https://doi.org/10.1109/ICSES52305.2021.9633960

  4. Zhang D, Le J, Mu N, Wu J, Liao X (2023) Secure and efficient data deduplication in jointcloud storage. IEEE Trans Cloud Comput 11(1):156–167. https://doi.org/10.1109/TCC.2021.3081702

    Article  Google Scholar 

  5. Viji D, Revathy S (2021) Comparative analysis for content defined chunking algorithms in data deduplication. Spec Issue Inf Retr Web Search 8:255–268. https://doi.org/10.14704/WEB/V18SI02/WEB18070

    Article  Google Scholar 

  6. Wang C, Fu Y, Yan J, Wu X, Zhang Y, Xia H, Yuan Y (2021) A cost-efficient resemblance detection scheme for post-deduplication delta compression in backup systems. wileyonlinelibrary.com/journal/cpe:1-13. https://doi.org/10.1002/cpe.6558

  7. Kumar PMA, Pugazhendhi E, Nayak RK (2022) Cloud storage performance improvement using deduplication and compression techniques. In: 2022 4th international conference on smart systems and inventive technology (ICSSIT). pp 443–449. https://doi.org/10.1109/ICSSIT53264.2022.9716524

  8. Keith W. How does data deduplication work? https://www.actualtechmedia.com/io/how-data-deduplication-works. Accessed 20 May 2014

  9. Chhabra N, Bala M (2020) A comparative study of data deduplication strategies. In: First international conference on secure cyber computing and communication (ICSCCC) 2020. pp 68–72. https://doi.org/10.1109/ICSCCC.2018.8703363

  10. Priya J, Vinothini C, Dinesh PS, Reshmi TS (2021) Data deduplication techniques: a comparative analysis. Int J Aquat Sci 12(3):1057–1065

    Google Scholar 

  11. Prajapati P, Shah P (2022) A review on secure data deduplication: cloud storage security issue. J King Saud Univ Comput Inf Sci 34(7):3996–4007. https://doi.org/10.1016/j.jksuci.2020.10.021

    Article  Google Scholar 

  12. Ni F, Jiang S (2019) RapidCDC: leveraging duplicate locality to accelerate chunking in CDC-based deduplication systems. In: Proceedings of the ACM symposium on cloud computing. pp 220–232. https://doi.org/10.1145/3357223.3362731

  13. Shakarami A, Ghobaei-Arani M, Shahidinejad A, Masdari M, Shakarami H (2021) Data replication schemes in cloud computing: a survey. Clust Comput 24(3):2545–2579. https://doi.org/10.1007/s10586-021-03283-7

    Article  Google Scholar 

  14. Lakshmi Narayana N, Tirapathi Reddy B (2020) A comprehensive study on data deduplication techniques in cloud storage systems. High Technol Lett 26(10):670–678

    Google Scholar 

  15. Kim WB, Lee IY (2021) Survey on data deduplication in cloud storage environments. J Inf Process Syst 17(3):658–673

    Google Scholar 

  16. Satish V, Singh DK (2016) Secure deduplication techniques: a study. Int J Comp Appl 137(8):41–43. https://doi.org/10.5120/ijca2016908874

    Article  Google Scholar 

  17. Rajput U, Shinde S, Thakur P, Patil G, Deokar P (2022) Analysis on deduplication techniques for storage of data in cloud. Int Res J Eng Technol 9(5):296–304

    Google Scholar 

  18. What are the real benefits of data deduplication in Cloud? https://www.webwerks.in/blogs/what-are-real-benefits-data-deduplication-cloud. Accessed 5 Dec 2022

  19. Data deduplication 101. https://www.computerweekly.com/tutorial/data-deduplication-101. Accessed 5 Dec 2022

  20. Donthu N, Kumar S, Mukherjee D, Pandey N, Lim WM (2021) How to conduct a bibliometric analysis: an overview and guidelines. J Bus Res 133:285–296. https://doi.org/10.1016/j.jbusres.2021.04.070

    Article  Google Scholar 

  21. Block JH, Fisch C (2020) Eight tips and questions for your bibliographic study in business and management research. Springer Manag Rev Q 70:307–312. https://doi.org/10.1007/s11301-020-00188-4

    Article  Google Scholar 

  22. Cobo MJ, Lopez-Herrera AG, Herrera-Viedma E, Herrera F (2011) An approach for detecting, quantifying, and visualizing the evolution of a research field: a practical application to the fuzzy sets theory field. J Informetr 5:146–166. https://doi.org/10.1016/j.joi.2010.10.002

    Article  Google Scholar 

  23. Rojas-Sánchez MA, Palos-Sánchez PR, Folgado-Fernández JA (2023) Systematic literature review and bibliometric analysis on virtual reality and education. Educ Inf Technol 28:155–192. https://doi.org/10.1007/s10639-022-11167-5

    Article  Google Scholar 

  24. Garg D, Sidhu J, Rani S (2019) Emerging trends in cloud computing security: a bibliometric analyses. IET Inst Eng Technol 13(3):223–231. https://doi.org/10.1049/iet-sen.2018.5222

    Article  Google Scholar 

  25. Hr S, Thangam (2021) A hybrid cloud approach for efficient data storage and security. In: 6th international conference on communication and electronics systems (ICCES). pp 1072–1076. https://doi.org/10.1109/ICCES51350.2021.9488938

  26. Sharma D, Kumar G, Sharma R (2021) Analysis of heterogeneous data storage and access control management for cloud computing under M/M/c queueing model. Int J Cloud Appl Comput 11(3):58–71. https://doi.org/10.4018/IJCAC.2021070104

    Article  Google Scholar 

  27. Khattar N, Singh J, Sidhu J (2019) Multi-criteria-based energy-efficient framework for VM placement in cloud data centers. Arab J Sci Eng 44:9455–9469. https://doi.org/10.1007/s13369-019-04048-6

    Article  Google Scholar 

  28. Nivedha R, Arshiya SS (2019) An effective system for storing data and resources using cloud computing. Int J Innov Technol Explor Eng 8(6S4):435–437

    Article  Google Scholar 

  29. Serenko A, Bontis N (2004) Meta-review of knowledge management and intellectual capital literature: citation impact and research productivity rankings, knowledge and process management. Wiley Publisher 11(3):185–190. https://doi.org/10.1002/kpm.203

    Article  Google Scholar 

  30. “The Publish or Perish Book,” Harzing.com. https://harzing.com/publications/publish-or-perish-book/pdf. Accessed 10 May 2023

  31. Garfield E (2004) Historiographic mapping of knowledge domains literature. J Inf Sci 30(2):119–145. https://doi.org/10.1177/0165551504042

    Article  Google Scholar 

  32. Jayantha WM, Oladinrin OT (2019) Bibliometric analysis of hedonic price model using CiteSpace. Int J Hous Mark Anal 13(2):357–371. https://doi.org/10.1108/IJHMA-04-2019-0044

    Article  Google Scholar 

  33. Perrson O, Danell R, Schneider JW (2009) How to use Bibexcel for various types of bibliometric analysis. In: Celebrating scholarly communication studies. pp 9–24

  34. Bastian M, Heymann S, Jacomy M (2009) Gephi: an open source software for exploring and manipulating networks. In: Proceedings of the international AAAI conference on web and social media, vol 3, no 1. pp 361–362. https://doi.org/10.1609/icwsm.v3i1.13937

  35. Rialti R, Marzi G, Ciappei C, Busso D (2019) Big data and dynamic capabilities: a bibliometric analysis and systematic literature review. Manag Decis 57(2):2052–2068. https://doi.org/10.1108/MD-07-2018-0821

    Article  Google Scholar 

  36. Eck VNJ, Waltman L (2010) Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics 84:523–538. https://doi.org/10.1007/s11192-009-0146-3

    Article  PubMed  Google Scholar 

  37. Ariaa M, Cuccurullo C (2017) Bibliometrix: an R-tool for comprehensive science mapping analysis. J Informetr 11(4):959–975. https://doi.org/10.1016/j.joi.2017.08.007

    Article  Google Scholar 

  38. Riehmann P, Hanfler M, Froehlich B (2005) Interactive Sankey diagrams. In: IEEE symposium on information visualization INFOVIS. pp 233–240. https://doi.org/10.1109/INFVIS.2005.1532152

  39. Wang T, Yang M, Guo Y, Wang J (2021) Virtualized resource image storage system based on data deduplication techniques. In: 2021 IEEE international conference on computer science, electronic information engineering and intelligent control technology (CEI). pp 298–302. https://doi.org/10.1109/CEI52496.2021.9574536

  40. Vianny MM, Vempati S, Pazhanivel K, Khasim S (2022) Intelligent compression scheme for securing storage preservation in virtualized hybrid cloud. ECS Trans 107(1):16689–16697. https://doi.org/10.1149/10701.16689ecst

    Article  ADS  Google Scholar 

  41. Ming Y, Wang C, Liu H, Zhao Y, Feng J, Zhang N, Shi W (2022) Blockchain-enabled efficient dynamic cross-domain deduplication in edge computing. IEEE Internet Things J 9(17):15639–15656. https://doi.org/10.1109/JIOT.2022.3150042

    Article  Google Scholar 

  42. Yuvaraj D, Kumar VP, Anandaram H, Samatha B, Krishnamoorthy R, Thiyagarajan R (2022) Secure DE-duplication over wireless sensing data using convergent encryption. In: 2022 IEEE 3rd global conference for advancement in technology (GCAT). pp 1–5. https://doi.org/10.1109/GCAT55367.2022.9971983

  43. Teng Y, Xian H, Lu Q, Guo F (2023) A data deduplication scheme based on DBSCAN with tolerable clustering deviation. IEEE Access 11:9742–9750. https://doi.org/10.1109/ACCESS.2022.3231604

    Article  Google Scholar 

  44. Xia W, Wei C, Li Z, Wang X, Zou X (2022) NetSync: a network adaptive and deduplication-inspired delta synchronization approach for cloud storage services. IEEE Trans Parallel Distrib Syst 33(10):2554–2570. https://doi.org/10.1109/TPDS.2022.3145025

    Article  Google Scholar 

  45. Afek Y, Giladi G, Patt-Shamir B (2021) Distributed computing with the cloud. In: Lecture notes in computer science. pp 1–20. https://doi.org/10.48550/arXiv.2109.12930

  46. You W, Chen B (2020) Proofs of ownership on encrypted cloud data via Intel SGX. In: Lecture notes in computer science, vol 12418. pp 400–416. https://doi.org/10.1007/978-3-030-61638-0_22

  47. Wang Z, Gao W, Yang M, Hao R (2022) Enabling secure data sharing with data deduplication and sensitive information hiding in cloud-assisted Electronic Medical Systems. Clust Comput. https://doi.org/10.1007/s10586-022-03785-y

    Article  Google Scholar 

  48. Phyu MP, Sinha GR (eds) (2021) Efficient data deduplication scheme for scale-out distributed storage. In: Data deduplication approaches. pp 153–182

  49. Patra SS, Jena S, Mohanty JR, Gourisaria MK (eds) (2021) DedupCloud: an optimized efficient virtual machine deduplication algorithm in cloud computing environment. In: Data deduplication approaches. Elsevier, pp 281–306

  50. Girish DS, Bhurane AA (eds) (2021) Essentials of data deduplication using open-source toolkit. In: Data deduplication approaches. Elsevier, pp 125–151

  51. Koushik CSN, Choubey SB, Choubey A, Sinha GR (eds) (2021) Data deduplication for cloud storage. In: Data deduplication approaches. Elsevier, pp 307–317

  52. Mandal R, Mondal MK, Banerjee S, Chakraborty C, Biswas U (2021) A survey and critical analysis on energy generation from datacenter. In: Data deduplication approaches. Elsevier, pp 203–230

  53. Muskan, Singh G, Singh J, Prabha C (2022) Data visualization and its key fundamentals: a comprehensive survey. In: 7th international conference on communication and electronics systems (ICCES). pp 1710–1714. https://doi.org/10.1109/ICCES54183.2022.9835803

  54. Li T, Bai J, Yang X, Liu Q, Chen Y (2018) Co-occurrence network of high-frequency words in the bioinformatics literature: structural characteristics and evolution. Appl Sci 8(10):1–14. https://doi.org/10.3390/app8101994

    Article  ADS  Google Scholar 

  55. Jayanthi MK, Saithya PVN, Vaibhavi PS, Reddy YH (2022) Achieving efficient data deduplication and key aggregation encryption system in cloud. In: International conference on intelligent emerging methods of artificial intelligence & cloud computing, vol 273. pp 328–340. https://doi.org/10.1007/978-3-030-92905-3_42

  56. Panyam AS, Jakkula PK, Rao N (2021) Significant cloud computing service for secured heterogeneous data storing and its managing by cloud users. In: 5th international conference on trends in electronics and informatics (ICOEI). pp 1447–1450. https://doi.org/10.1109/ICOEI51242.2021.9452970

  57. Rodríguez-Ruiz F, Almodóvar P, Nguyen Q (2019) Intellectual structure of international new venture research: a bibliometric analysis and suggestions for a future research agenda. Multinatl Bus Rev 27(4):285–316. https://doi.org/10.1108/MBR-01-2018-0003

    Article  Google Scholar 

  58. Zhang D, Deng Y, Zhou Y, Li J, Zhu W, Min G (2022) MGRM: a multi-segment greedy rewriting method to alleviate data fragmentation in deduplication-based cloud backup systems. IEEE Trans Cloud Comput. https://doi.org/10.1109/TCC.2022.3214816

    Article  Google Scholar 

  59. Li J, Li T, Liu Z, Chen X (2019) Secure deduplication system with active key update and its application in IoT. ACM Trans Intell Syst Technol 10(6):1–21. https://doi.org/10.1145/3356468

    Article  Google Scholar 

  60. Li J, Hou M (2018) Improving data availability for deduplication in cloud storage. Int J Grid High Perform Comput 10(2):70–89. https://doi.org/10.4018/IJGHPC.2018040106

    Article  Google Scholar 

  61. Wei J, Niu X, Zhang R, Liu J, Yao Y (2017) Efficient data possession–checking protocol with deduplication in cloud. Int J Distrib Sens Netw. https://doi.org/10.1177/1550147717727461

    Article  Google Scholar 

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Contributions

All authors have contributed to the article to be recognized as co-author of this article. Conceptualization, methodology, implementation: AG; draft preparation: CP; and supervision, writing, reviews, and editing: PS; writing, reviews, and editing: NM; and writing, reviews, and editing: VM.

Corresponding author

Correspondence to Nitin Mittal.

Ethics declarations

Conflict of interest

The authors affirm that they have no known financial or interpersonal conflicts that would have appeared to have an impact on the research presented in this study.

Ethical Approval

All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national).

Human and Animal Rights

This article does not contain any studies with human or animal subjects performed by the any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goel, A., Prabha, C., Sharma, P. et al. Emerging Research Trends in Data Deduplication: A Bibliometric Analysis from 2010 to 2023. Arch Computat Methods Eng (2024). https://doi.org/10.1007/s11831-024-10074-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11831-024-10074-x

Navigation