Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Scalable Architectures for Big Data Analysis

  • Peng Sun
  • Yonggang WenEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_281

Overview

The era of big data is upon us. However, traditional data management and analysis systems, which are mainly based on relational database management system (RDBMS), may not be able to handle the ever-growing data volume. Therefore, it is important to design scalable system architectures to efficiently process big data and exploit their value. This chapter discusses various horizontal and vertical scaling big data platforms, focusing on their architectural principle for big data analysis applications, such as machine learning and graph processing. This chapter could aid users to select right system architectures or platforms for their big data applications.

Introduction

This is an era of big data, evidenced by the sheer volume of data from a variety of sources and its growing rate of generation. According to a report from the International Data Corporation (IDC), the global data volume will grow by a factor of 300, from 130 exabytes (1 exabyte = 106terabytes) to 40,000...

This is a preview of subscription content, log in to check access.

References

  1. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, and Zheng X (2016) Tensorflow: a system for large-scale machine learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI’16). USENIX Association, SavannahGoogle Scholar
  2. Anderson MJ, Sundaram N, Satish N, Patwary MMA, Willke TL, and Dubey P (2016) Graphpad: optimized graph primitives for parallel and distributed platforms. In: IPDPS. IEEE, pp 313–322Google Scholar
  3. Bergstra J, Bastien F, Breuleux O, Lamblin P, Pascanu R, Delalleau O, Desjardins G, Warde-Farley D, Goodfellow I, Bergeron A et al (2011) Theano: deep learning on GPUs with python. In: NIPS 2011, BigLearning workshop, GranadaGoogle Scholar
  4. Beyer MA, Laney D (2012) The importance of big data: a definition. Gartner, Stamford, pp 2014–2018Google Scholar
  5. Bu Y, Howe B, Balazinska M, Ernst MD (2010) Haloop: efficient iterative data processing on large clusters. Proc VLDB Endow 3(1–2):285–296CrossRefGoogle Scholar
  6. Bu Y, Borkar V, Jia J, Carey MJ, Condie T (2014) Pregelix: big(ger) graph analytics on a dataflow engine. Proc VLDB Endow 8(2):161–172CrossRefGoogle Scholar
  7. Buluç A, Gilbert JR (2011) The combinatorial blas: design, implementation, and applications. Int J High Perfor Comput Appl 25(4):496–509CrossRefGoogle Scholar
  8. Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Netw Appl 19(2):171–209CrossRefGoogle Scholar
  9. Chen R, Shi J, Chen Y, Chen H (2015a) Powerlyra: differentiated graph computation and partitioning on skewed graphs. In: Proceedings of the tenth European conference on computer systems. ACM, p 1Google Scholar
  10. Chen T, Li M, Li Y, Lin M, Wang N, Wang M, Xiao T, Xu B, Zhang C, Zhang Z (2015b) Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512. 01274Google Scholar
  11. Chilimbi T, Suzue Y, Apacible J, Kalyanaraman K (2014) Project Adam: building an efficient and scalable deep learning training system. In: 11th USENIX symposium on operating systems design and implementation (OSDI’14), pp 571–582Google Scholar
  12. Ching A, Edunov S, Kabiljo M, Logothetis D, Muthukrishnan S (2015) One trillion edges: graph processing at facebook-scale. Proc VLDB Endow 8(12):1804–1815CrossRefGoogle Scholar
  13. Dai G, Chi Y, Wang Y, Yang H (2016) FPGP: graph processing framework on FPGA a case study of breadth-first search. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 105–110Google Scholar
  14. Dayarathna M, Wen Y, Fan R (2016) Data center energy consumption modeling: a survey. IEEE Commun Surv Tutorials 18(1):732–794CrossRefGoogle Scholar
  15. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  16. Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing. ACM, pp 810–818Google Scholar
  17. Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A et al (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: European parallel virtual machine/message passing interface users group meeting. Springer, pp 97–104Google Scholar
  18. Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView IDC Analyze Future 2007(2012):1–16Google Scholar
  19. Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) Powergraph: distributed graph-parallel computation on natural graphs. OSDI 12(1):2Google Scholar
  20. Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I (2014) Graphx: graph processing in a distributed dataflow framework. In: 11th USENIX symposium on operating systems design and implementation (OSDI’14). USENIX Association, Broomfield, pp 599–613Google Scholar
  21. Gropp W, Lusk E, Skjellum A (1999) Using MPI: portable parallel programming with the message-passing interface, vol 1. MIT press, CambridgezbMATHCrossRefGoogle Scholar
  22. Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz RH, Shenker S, Stoica I (2011) Mesos: a platform for fine-grained resource sharing in the data center. NSDI 11:22–22Google Scholar
  23. Hu H, Wen Y, Chua TS, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687CrossRefGoogle Scholar
  24. Iosup A, Hegeman T, Ngai WL, Heldens S, Prat-Pérez A, Manhardto T, Chafio H, Capotă M, Sundaram N, Anderson M et al (2016) Ldbc graphalytics: a benchmark for large-scale graph analysis on parallel and distributed platforms. Proc VLDB Endow 9(13):1317–1328CrossRefGoogle Scholar
  25. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM international conference on multimedia. ACM, pp 675–678Google Scholar
  26. Karonis NT, Toonen B, Foster I (2003) Mpich-g2: a grid-enabled implementation of the message passing interface. J Parallel Distrib Comput 63(5): 551–563zbMATHCrossRefGoogle Scholar
  27. Khorasani F, Vora K, Gupta R, Bhuyan LN (2014) Cusha: vertex-centric graph processing on GPUs. In: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. ACM, pp 239–252Google Scholar
  28. Laney D (2001) 3D data management: controlling data volume, velocity and variety. META Group Res Note 6(70):1–4Google Scholar
  29. Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su BY (2014) Scaling distributed machine learning with the parameter server. In 11th USENIX symposium on operating systems design and implementation (OSDI’14). USENIX Association, Broomfield, pp 583–598Google Scholar
  30. Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 135–146Google Scholar
  31. Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S et al (20116) MLlib: machine learning in apache spark. J Mach Learn Res 17(1):1235–1241Google Scholar
  32. Nurvitadhi E, Weisz G, Wang Y, Hurkat S, Nguyen M, Hoe JC, Martínez JF, Guestrin C (2014) Graphgen: an fpga framework for vertex-centric graph computation. In: IEEE 22nd annual international symposium on field-programmable custom computing machines (FCCM). IEEE, pp 25–28Google Scholar
  33. Ovtcharov K, Ruwase O, Kim JY, Fowers J, Strauss K, Chung ES (2015) Accelerating deep convolutional neural networks using specialized hardware. Microsoft Res Whitepaper 2(11):1–4Google Scholar
  34. Panda DK, Tomko K, Schulz K, Majumdar A (2013) The MVAPICH project: evolution and sustainability of an open source production quality MPI library for HPC. In: Workshop on sustainable software for science: practice and experiences, held in conjunction with international conference on supercomputing (WSSPE)Google Scholar
  35. Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S et al (2016) Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 26–35Google Scholar
  36. Roy A, Bindschaedler L, Malicevic J, Zwaenepoel W (2015) Chaos: scale-out graph processing from secondary storage. In: Proceedings of the 25th symposium on operating systems principles. ACM, pp 410–424Google Scholar
  37. Salihoglu S, Widom J (2013) GPS: a graph processing system. In: Proceedings of the 25th international conference on scientific and statistical database management. ACM, p 22Google Scholar
  38. Schelter S, Satuluri V, Zadeh R (2014) Factorbird-a parameter server approach to distributed matrix factorization. arXiv preprint arXiv:1411.0602Google Scholar
  39. Shun J, Blelloch GE (2013) Ligra: a lightweight graph processing framework for shared memory. In: ACM SIGPLAN notices, vol 48(8). ACM, pp 135–146Google Scholar
  40. Shun J, Dhulipala L, Blelloch GE (2015) Smaller and faster: parallel processing of compressed graphs with ligra+. In: Data compression conference (DCC). IEEE, pp 403–412Google Scholar
  41. Wang W, Chen G, Dinh ATT, Gao J, Ooi BC, Tan KL, Wang S (2015) SINGA: putting deep learning in the hands of multimedia users. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 25–34Google Scholar
  42. Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens JD (2016) Gunrock: a high-performance graph processing library on the GPU. In: ACM SIGPLAN notices 51(8). ACM, p 11Google Scholar
  43. White T (2012) Hadoop: the definitive guide. O’Reilly Media, Inc., SebastopolGoogle Scholar
  44. Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, Yu Y (2015) Petuum: a new platform for distributed machine learning on big data. IEEE Trans Big Data 1(2):49–67CrossRefGoogle Scholar
  45. Yan D, Cheng J, Xing K, Lu Y, Ng W, Bu Y (2014) Pregel algorithms for graph connectivity problems with performance guarantees. Proc VLDB Endow 7(14):1821–1832CrossRefGoogle Scholar
  46. Yan D, Huang Y, Liu M, Chen H, Cheng J, Wu H, Zhang C (2017) Graphd: distributed vertex-centric graph processing beyond the memory limit. IEEE Trans Parallel Distrib Syst 29(1):99–114CrossRefGoogle Scholar
  47. Yang F, Li J, Cheng J (2016) Husky: towards a more efficient and expressive distributed computing framework. Proc VLDB Endow 9(5):420–431CrossRefGoogle Scholar
  48. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX AssociationGoogle Scholar
  49. Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. ACM, pp 423–438Google Scholar
  50. Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J (2015) Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 161–170Google Scholar
  51. Zhao Y, Yoshigoe K, Bian J, Xie M, Xue Z, Feng Y (2016) A distributed graph-parallel computing system with lightweight communication overhead. IEEE Trans Big Data 2(3):204–218CrossRefGoogle Scholar
  52. Zhong J, He B (2014) Medusa: simplified graph processing on GPUs. IEEE Trans Parallel Distrib Syst 25(6):1543–1552MathSciNetCrossRefGoogle Scholar
  53. Zhou C, Gao J, Sun B, Yu JX (2014) Mocgraph: scalable distributed graph processing using message online computing. Proc VLDB Endow 8(4):377–388CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.School of Computer Science and EngineeringNanyang Technological UniversitySingaporeSingapore