Large-Scale Data Management System Using Data De-duplication System

  • S. Abirami
  • Rashmi Vikraman
  • S. Murugappan
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 379)


Data de-duplication is the process of finding duplicates and eliminating it from the storage environment. There are various levels where the data de-duplication can be performed, such as file level, where the entire file as a whole is considered for the purpose of duplicate detection. Second is chunk level, where the file is split into small units called chunks and those chunks are used for the duplicate detection. Third is byte level, where the comparisons take byte-level comparison. The fingerprint of the chunks is the main parameter for the duplicate detection. These fingerprints are placed inside the chunk index. As the chunk index size increases, the chunk index needs to be placed in the disk. Searching for the fingerprint in the chunk index placed in the disk will consume a lot of time which will lead to a problem known as chunk lookup disk bottleneck problem. This paper eliminates that problem to some extent by placing a bloom filter in the cache as a probabilistic summary of all the fingerprints in the chunk index placed in the disk. This paper uses the backup data sets obtained from the university labs. The performance is measured with respect to the data de-duplication ratio.


Data de-duplication Storage Compression 


  1. 1.
    Mark, R.C., Whitner, S.: Data De-duplication for Dummies. Wiley Publishing, Inc (2008)Google Scholar
  2. 2.
    Vikraman, Rashmi, Abirami, S.: A study on various data de-duplication systems. Int. J. Comput. Appl. 94(4), 35–40 (2014)Google Scholar
  3. 3.
    Bhagwat, D., Eshghi, K., Lillibridge, M., Long, D.D.E.: Extreme binning: scalable, parallel de-duplication for chunk-based file backup. In: Proceedings of the IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 1–9 (2009)Google Scholar
  4. 4.
    Thein, N.L., Thwel T.T.: An efficient indexing mechanism for data de-duplication. In: Proceedings of the International Conference on the Current Trends in Information Technology (CTIT), pp. 1–5 (2012)Google Scholar
  5. 5.
    He, Q., Zhang, X., Li, Z.: Data de-duplication techniques. In: Proceedings of the International Conference on Future Information Technology and Management Engineering, pp. 430–433 (2010)Google Scholar
  6. 6.
    Rothenberg, C.E., Lagerspetz, E., Tarkoma, S.: Theory and practice of bloom filters for distributed systems. Published in IEEE Communications Surveys and Tutorials, pp. 131–155 (2012)Google Scholar
  7. 7.
    Zhu, B., Patterson, H., Li, K.: Avoiding the disk bottleneck in the data domain de-duplication file system. In: Proceedings of the Sixth USENIX Conference on File and Storage Technologies, pp. 269–282 (2008)Google Scholar
  8. 8.
    Rabin, M.O.: Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University (1981)Google Scholar
  9. 9.
    Chang, B., Moh, T.: A running time improvement for two thresholds two divisors algorithm. In: Proceedings of the ACM Southeast Regional Conference, pp. 69–107 (2010)Google Scholar
  10. 10.
    Mishra, M., Sengar, S.S.: E-DAID: an efficient distributed architecture for in-line data de-duplication. In: Proceedings of the International Conference on Communication Systems and Network Technologies, pp. 438–442 (2012)Google Scholar
  11. 11.
    Wang, C., Wan, J., Yang, L., Qin, Z.-G.: A fast duplicate chunk identifying method based on hierarchical indexing structure. In: Proceedings of the International Conference on Industrial Control and Electronics Engineering (ICICEE), pp. 624–627 (2012)Google Scholar
  12. 12.
    Gadan, A., Miller, E., Rodeh, O.: HANDS: a heuristically arranged non-backup in-line de-duplication system. In: Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE), pp. 446–457 (2013)Google Scholar
  13. 13.
    Feng, D., Sha, E.H., Ge, X., Tan, Y., Yan, Z.: Reducing the de-linearization of data placement to improve de-duplication performance. In: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp. 796–800 (2012)Google Scholar

Copyright information

© Springer India 2016

Authors and Affiliations

  1. 1.Department of Information Science and TechnologyCollege of Engineering, Anna UniversityChennaiIndia
  2. 2.School of Computer ScienceTamilnadu Open UniversityChennaiIndia

Personalised recommendations