Skip to main content
Book cover

Big Data pp 33–49Cite as

Big Data Storage

Part of the SpringerBriefs in Computer Science book series (BRIEFSCOMPUTER)

Abstract

In this chapter, we focus on the storage of big data. We will review important issues including massive storage systems, distributed storage systems, and big data storage mechanisms. On one hand, the storage infrastructure need to provide information storage service with reliable storage space; on the other hand, it must provide a powerful access interface for query and analysis of large amount of data. Such a storage infrastructure generally consists of hardware infrastructure and storage mechanisms.

Keywords

  • Server Failure
  • MapReduce Framework
  • Distribute File System
  • Data Storage System
  • NoSQL Database

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-06245-7_4
  • Chapter length: 17 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   54.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-06245-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   69.99
Price excludes VAT (USA)

References

  1. Eric A Brewer. Towards robust distributed systems. In PODC, page 7, 2000.

    Google Scholar 

  2. Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 33(2):51–59, 2002.

    CrossRef  Google Scholar 

  3. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, 2008.

    Google Scholar 

  4. James Manyika, McKinsey Global Institute, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute, 2011.

    Google Scholar 

  5. Rick Cattell. Scalable sql and nosql data stores. ACM SIGMOD Record, 39(4):12–27, 2011.

    CrossRef  Google Scholar 

  6. Marshall K McKusick and Sean Quinlan. Gfs: Evolution on fast-forward. ACM Queue, 7(7):10, 2009.

    Google Scholar 

  7. Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1(2):1265–1276, 2008.

    Google Scholar 

  8. Doug Beaver, Sanjeev Kumar, Harry C Li, Jason Sobel, Peter Vajgel, et al. Finding a needle in haystack: Facebook’s photo storage. In OSDI, volume 10, pages 1–8, 2010.

    Google Scholar 

  9. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon’s highly available key-value store. In SOSP, volume 7, pages 205–220, 2007.

    Google Scholar 

  10. David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 654–663. ACM, 1997.

    Google Scholar 

  11. Mike Burrows. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 335–350. USENIX Association, 2006.

    Google Scholar 

  12. Avinash Lakshman and Prashant Malik. Cassandra: structured storage system on a p2p network. In Proceedings of the 28th ACM symposium on Principles of distributed computing, pages 5–5. ACM, 2009.

    Google Scholar 

  13. Lars George. HBase: the definitive guide. O’Reilly Media, Inc., 2011.

    Google Scholar 

  14. Doug Judd. hypertable-0.9. 0.4-alpha.

    Google Scholar 

  15. Kristina Chodorow. MongoDB: the definitive guide. O’Reilly, 2013.

    Google Scholar 

  16. Douglas Crockford. The application/json media type for javascript object notation (json). 2006.

    Google Scholar 

  17. James Murty. Programming Amazon Web Services: S3, EC2, SQS, FPS, and SimpleDB. O’Reilly Media, Inc., 2009.

    Google Scholar 

  18. J Chris Anderson, Jan Lehnardt, and Noah Slater. CouchDB: the definitive guide. O’Reilly, 2010.

    Google Scholar 

  19. Brian F Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni. Pnuts: Yahoo!’s hosted data serving platform. Proceedings of the VLDB Endowment, 1(2):1277–1288, 2008.

    Google Scholar 

  20. Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM symposium on Cloud computing, pages 143–154. ACM, 2010.

    Google Scholar 

  21. Tim Kraska, Martin Hentschel, Gustavo Alonso, and Donald Kossmann. Consistency rationing in the cloud: Pay only when it matters. Proceedings of the VLDB Endowment, 2(1):253–264, 2009.

    Google Scholar 

  22. Kimberly Keeton, Charles B Morrey III, Craig AN Soules, and Alistair Veitch. Lazybase: Freshness vs. performance in information management. ACM SIGOPS Operating Systems Review, 44(1):15–19, 2010.

    Google Scholar 

  23. Daniela Florescu and Donald Kossmann. Rethinking cost and performance of database systems. ACM Sigmod Record, 38(1):43–48, 2009.

    CrossRef  Google Scholar 

  24. Maarten Van Steen. Distributed systems principles and paradigms. Network, 4:20, 2004.

    Google Scholar 

  25. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

    CrossRef  Google Scholar 

  26. Spyros Blanas, Jignesh M Patel, Vuk Ercegovac, Jun Rao, Eugene J Shekita, and Yuanyuan Tian. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 975–986. ACM, 2010.

    Google Scholar 

  27. Hung-Chih Yang and D Stott Parker. Traverse: Simplified indexing on large map-reduce-merge clusters. In Database Systems for Advanced Applications, pages 308–322. Springer, 2009.

    Google Scholar 

  28. Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming, 13(4):277–298, 2005.

    Google Scholar 

  29. Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. Proceedings of the VLDB Endowment, 2(2):1414–1425, 2009.

    Google Scholar 

  30. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626–1629, 2009.

    Google Scholar 

  31. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review, 41(3):59–72, 2007.

    CrossRef  Google Scholar 

  32. Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, volume 8, pages 1–14, 2008.

    Google Scholar 

  33. Christopher Moretti, Jared Bulosan, Douglas Thain, and Patrick J Flynn. All-pairs: An abstraction for data-intensive cloud computing. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1–11. IEEE, 2008.

    Google Scholar 

  34. Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135–146. ACM, 2010.

    Google Scholar 

  35. Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D Ernst. Haloop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1–2):285–296, 2010.

    Google Scholar 

  36. Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. Twister: a runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 810–818. ACM, 2010.

    Google Scholar 

  37. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2–2. USENIX Association, 2012.

    Google Scholar 

  38. Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A Acar, and Rafael Pasquin. Incoop: Mapreduce for incremental computations. In Proceedings of the 2nd ACM Symposium on Cloud Computing, page 7. ACM, 2011.

    Google Scholar 

  39. Derek G Murray, Malte Schwarzkopf, Christopher Smowton, Steven Smith, Anil Madhavapeddy, and Steven Hand. Ciel: a universal execution engine for distributed data-flow computing. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, page 9, 2011.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 The Author(s)

About this chapter

Cite this chapter

Chen, M., Mao, S., Zhang, Y., Leung, V.C.M. (2014). Big Data Storage. In: Big Data. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-06245-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06245-7_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06244-0

  • Online ISBN: 978-3-319-06245-7

  • eBook Packages: Computer ScienceComputer Science (R0)