An Efficient and Performance-Aware Big Data Storage System

  • Yang Li
  • Li Guo
  • Yike Guo
Part of the Communications in Computer and Information Science book series (CCIS, volume 367)


Recent escalations in Internet development and volume of data have created a growing demand for large-capacity storage solutions. Although Cloud storage has yielded new ways of storing, accessing and managing data, there is still a need for an inexpensive, effective and efficient storage solution especially suited to big data management and analysis. In this paper, we take our previous work one step further and present an in-depth analysis of the key features of future big data storage services for both unstructured and semi-structured data, and discuss how such services should be constructed and deployed. We also explain how different technologies can be combined to provide a single, highly scalable, efficient and performance-aware big data storage system. We especially focus on the issues of data de-duplication for enterprises and private organisations. This research is particularly valuable for inexperienced solution providers like universities and research organisations, and will allow them to swiftly set up their own big data storage services.


Big Data Storage Cloud Computing Cloud Storage Amazon S3 CACSS 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amazon. Amazon Simple Storage Service (S3),
  2. 2.
    Google. Google Cloud Storage Service,
  3. 3.
    AWS Case Study: SmugMug (2013)Google Scholar
  4. 4.
  5. 5.
    AWS Case Study: Jungle DiskGoogle Scholar
  6. 6.
    Amazon, Amazon S3 - The First Trillion Objects (2012)Google Scholar
  7. 7.
    Gohring, N.: Amazon’s S3 Down for Several HoursGoogle Scholar
  8. 8.
    Brodkin, J.: Outage hits Amazon S3 storage service (2008) Google Scholar
  9. 9.
    Li, Y., Guo, L., Guo, Y.: CACSS: Towards a Generic Cloud Storage Service. In: CLOSER 2012, pp. 27–36. SciTePress (2012)Google Scholar
  10. 10.
    Garfinkel, S.L.: An evaluation of amazon’s grid computing services: EC2, S3, and SQS. Citeseer (2007)Google Scholar
  11. 11.
    Rackspace. Cloud Files,
  12. 12.
    Barr, J.: (2011) Google Scholar
  13. 13.
    Wang, G., Ng, T.E.: The impact of virtualization on network performance of amazon ec2 data center. In: 2010 Proceedings of the IEEE INFOCOM. IEEE (2010)Google Scholar
  14. 14.
    Garfinkel, S.L.: An evaluation of amazon’s grid computing services: EC2, S3, and SQS. in Center for. 2007. Citeseer (2007)Google Scholar
  15. 15.
  16. 16.
    Nurmi, D., et al.: The eucalyptus open-source cloud-computing system. IEEE (2009)Google Scholar
  17. 17.
    Abe, Y., Gibson, G.: pWalrus: Towards better integration of parallel file systems into cloud storage. IEEE (2010)Google Scholar
  18. 18.
    Bresnahan, J., et al.: Cumulus: an open source storage cloud for science. SC10 Poster (2010)Google Scholar
  19. 19.
    Borthakur, D.: The hadoop distributed file system: Architecture and design. Hadoop Project Website (2007)Google Scholar
  20. 20.
  21. 21.
    Carstoiu, D., Cernian, A., Olteanu, A.: Hadoop Hbase-0.20.2 performance evaluation. In: 2010 4th International Conference on New Trends in Information Science and Service Science, NISS (2010)Google Scholar
  22. 22.
    Khetrapal, A., Ganesh, V.: HBase and Hypertable for large scale distributed storage systems. Dept. of Computer Science, Purdue University (2006)Google Scholar
  23. 23.
    Saab, P.: Scaling memcached at Facebook. Facebook Engineering Note (2008)Google Scholar
  24. 24.
    Barroso, L.A., Dean, J., Holzle, U.: Web search for a planet: The Google cluster architecture. IEEE Micro 23(2), 22–28 (2003)CrossRefGoogle Scholar
  25. 25.
    Chang, F., et al.: Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26(2), 4 (2008)Google Scholar
  26. 26.
    Ongaro, D., et al.: Fast crash recovery in RAMCloud. In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM (2011)Google Scholar
  27. 27.
    Tianming, Y., et al.: DEBAR: A scalable high-performance de-duplication storage system for backup and archiving. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, IPDPS (2010)Google Scholar
  28. 28.
    Yujuan, T., et al.: SAM: A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup. In: 2010 39th International Conference on Parallel Processing, ICPP (2010)Google Scholar
  29. 29.
    Chuanyi, L., et al.: ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System. In: Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/Os, SNAPI 2008 (2008)Google Scholar
  30. 30.
    Quinlan, S., Dorward, S.: Venti: A new approach to archival storage. In: Proceedings of the FAST 2002 Conference on File and Storage Technologies (2002)Google Scholar
  31. 31.
    You, L.L., Pollack, K.T., Long, D.D.: Deep Store: An archival storage system architecture. In: Proceedings of the 21st International Conference on Data Engineering, ICDE 2005. IEEE (2005)Google Scholar
  32. 32.
    Dubnicki, C., et al.: Hydrastor: A scalable secondary storage. In: Procedings of the 7th Conference on File and Storage Technologies. USENIX Association (2009)Google Scholar
  33. 33.
    Jiansheng, W., et al.: MAD2: A scalable high-throughput exact deduplication approach for network backup services. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST (2010)Google Scholar
  34. 34.
    Guo, Y.-K., Guo, L.: IC cloud: Enabling compositional cloud. International Journal of Automation and Computing 8(3), 269–279 (2011)CrossRefGoogle Scholar
  35. 35.
    Sandberg, R., et al.: Design and implementation of the Sun network filesystem (1985)Google Scholar
  36. 36.
    Carns, P.H., et al.: PVFS: A parallel file system for Linux clusters. USENIX Association (2000)Google Scholar
  37. 37.
    Schwan, P.: Lustre: Building a file system for 1000-node clusters (2003)Google Scholar
  38. 38.
    Gilbert, H., Handschuh, H.: Security analysis of SHA-256 and sisters. In: Matsui, M., Zuccherato, R.J. (eds.) SAC 2003. LNCS, vol. 3006, pp. 175–193. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  39. 39.
    Apache. Hadoop MapReduce,
  40. 40.
    Borthakur, D.: Hadoop avatarnode high availability (2010)Google Scholar
  41. 41.
    Doclo, L.: Clustering Tomcat Servers with High Availability and Disaster Fallback (2011)Google Scholar
  42. 42.
    Mulesoft, Tomcat Clustering - A Step By Step GuideGoogle Scholar
  43. 43.
  44. 44.

Copyright information

© Springer International Publishing Switzerland 2013

Authors and Affiliations

  • Yang Li
    • 1
  • Li Guo
    • 1
  • Yike Guo
    • 1
  1. 1.Department of ComputingImperial College LondonU.K.

Personalised recommendations