\({\textsc {DDFlasks}}\): Deduplicated Very Large Scale Data Store

  • Francisco Maia
  • João Paulo
  • Fábio Coelho
  • Francisco Neves
  • José Pereira
  • Rui Oliveira
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10320)


With the increasing number of connected devices, it becomes essential to find novel data management solutions that can leverage their computational and storage capabilities. However, developing very large scale data management systems requires tackling a number of interesting distributed systems challenges, namely continuous failures and high levels of node churn. In this context, epidemic-based protocols proved suitable and effective and have been successfully used to build DataFlasks, an epidemic data store for massive scale systems. Ensuring resiliency in this data store comes with a significant cost in storage resources and network bandwidth consumption. Deduplication has proven to be an efficient technique to reduce both costs but, applying it to a large-scale distributed storage system is not a trivial task. In fact, achieving significant space-savings without compromising the resiliency and decentralized design of these storage systems is a relevant research challenge.

In this paper, we extend DataFlasks with deduplication to design DDFlasks. This system is evaluated in a real world scenario using Wikipedia snapshots, and the results are twofold. We show that deduplication is able to decrease storage consumption up to 63% and decrease network bandwidth consumption by up to 20%, while maintaining a fully-decentralized and resilient design.



The research leading to these results was part-funded by (1) Project TEC4Growth - Pervasive Intelligence, Enhancers and Proofs of Concept with Industrial Impact/NORTE-01-0145-FEDER-000020 is financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF); (2) the ERDF European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation - COMPETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by National Funds through the FCT Portuguese Foundation for Science and Technology as part of project UID/EEA/50014/2013 and by (3) the European Union’s Horizon 2020 - The EU Framework Programme for Research and Innovation 2014–2020, under grant agreement No. 732051.


  1. 1.
    Bhagwat, D., Eshghi, K., Long, D.D.E., Lillibridge, M.: Extreme binning: scalable, parallel deduplication for chunk-based file backup. In: International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 1–9 (2009)Google Scholar
  2. 2.
    Blake, C., Rodrigues, R.: High availability, scalable storage, dynamic peer networks: pick two. In: Conference on Hot Topics in Operating Systems, vol. 9, p. 1 (2003)Google Scholar
  3. 3.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 4 (2008)CrossRefGoogle Scholar
  4. 4.
    Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.A., Puz, N., Weaver, D., Yerneni, R.: Pnuts: Yahoo!’s hosted data serving platform. VLDB Endowment 1(2), 1277–1288 (2008)CrossRefGoogle Scholar
  5. 5.
    Cox, L.P., Murray, C.D., Noble, B.D.: Pastiche: making backup cheap and easy. In: Symposium on Operating Systems Design and Implementation, pp. 1–13 (2002)Google Scholar
  6. 6.
    DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. ACM SIGOPS Oper. Syst. Rev. 41(6), 205–220 (2007)CrossRefGoogle Scholar
  7. 7.
    Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S., Shilane, P.: Tradeoffs in scalable data routing for deduplication clusters. In: USENIX Conference on File and Storage Technologies, pp. 15–29 (2011)Google Scholar
  8. 8.
    Douceur, J.R., Adya, A., Bolosky, W.J., Simon, D., Theimer, M.: Reclaiming space from duplicate files in a serverless distributed file system. Technical report MSR-TR-2002-30, Microsoft Research, July 2002Google Scholar
  9. 9.
    Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M.: HYDRAstor: a scalable secondary storage. In: USENIX Conference on File and Storage Technologies, pp. 197–210 (2009)Google Scholar
  10. 10.
    Wikimedia Foundation: Wikipedia web page (2016).
  11. 11.
    Frey, D., Kermarrec, A.M., Kloudas, K.: Probabilistic deduplication for cluster-based storage systems. In: ACM Symposium on Cloud Computing, pp. 1–14 (2012)Google Scholar
  12. 12.
    Fu, Y., Jiang, H., Xiao, N.: A scalable inline cluster deduplication framework for big data protection. In: International Middleware Conference, pp. 354–373 (2012)Google Scholar
  13. 13.
    Gupta, A., Liskov, B., Rodrigues, R.: Efficient routing for peer-to-peer overlays. In: USENIX Symposium on Networked Systems Design and Implementation (2004)Google Scholar
  14. 14.
    IDC: the digital universe of opportunities: rich data and the increasing value of the internet of things, April 2014.
  15. 15.
    Klophaus, R.: Riak core: building distributed applications without shared state. In: ACM SIGPLAN Commercial Users Functional Programming, p. 14 (2010)Google Scholar
  16. 16.
    Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)CrossRefGoogle Scholar
  17. 17.
    Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: USENIX Conference on File and Storage Technologies, pp. 111–123 (2009)Google Scholar
  18. 18.
    Maia, F., Matos, M., Vilaça, R., Pereira, J., Oliveira, R., Rivire, E.: Dataflasks: epidemic store for massive scale systems. In: International Symposium on Reliable Distributed Systems, pp. 79–88 (2014)Google Scholar
  19. 19.
    Meyer, D.T., Bolosky, W.J.: A study of practical deduplication. ACM Trans. Storage 7(4) (2012). Article No. 14Google Scholar
  20. 20.
    Muthitacharoen, A., Chen, B., Mazières, D.: A low-bandwidth network file system. In: Symposium on Operating Systems Principles, pp. 174–187 (2001)Google Scholar
  21. 21.
    Paulo, J., Pereira, J.: A survey and classification of storage deduplication systems. ACM Comput. Surv. 47(1), 11: 1–11: 30 (2014)Google Scholar
  22. 22.
    Quinlan, S., Dorward, S.: Venti: a new approach to archival storage. In: USENIX Conference on File and Storage Technologies, pp. 1–13 (2002)Google Scholar
  23. 23.
    Rhea, S., Geels, D., Roscoe, T., Kubiatowicz, J.: Handling churn in a DHT. In: Proceedings of the USENIX Annual Technical Conference (2004)Google Scholar
  24. 24.
    Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek, F., Balakrishnan, H.: Chord: a scalable peer-to-peer lookup protocol for internet applications. Netw. IEEE/ACM Trans. 11(1), 17–32 (2003)CrossRefGoogle Scholar
  25. 25.
    Xia, W., Jiang, H., Feng, D., Hua, Y.: Silo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In: USENIX Annual Technical Conference, pp. 26–30 (2011)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2017

Authors and Affiliations

  • Francisco Maia
    • 1
  • João Paulo
    • 1
  • Fábio Coelho
    • 1
  • Francisco Neves
    • 1
  • José Pereira
    • 1
  • Rui Oliveira
    • 1
  1. 1.HASLab, INESC TECUniversity of MinhoBragaPortugal

Personalised recommendations