Advertisement

Knowledge and Information Systems

, Volume 60, Issue 3, pp 1165–1245 | Cite as

The big data system, components, tools, and technologies: a survey

  • T. Ramalingeswara RaoEmail author
  • Pabitra Mitra
  • Ravindara Bhatt
  • A. Goswami
Survey Paper

Abstract

The traditional databases are not capable of handling unstructured data and high volumes of real-time datasets. Diverse datasets are unstructured lead to big data, and it is laborious to store, manage, process, analyze, visualize, and extract the useful insights from these datasets using traditional database approaches. However, many technical aspects exist in refining large heterogeneous datasets in the trend of big data. This paper aims to present a generalized view of complete big data system which includes several stages and key components of each stage in processing the big data. In particular, we compare and contrast various distributed file systems and MapReduce-supported NoSQL databases concerning certain parameters in data management process. Further, we present distinct distributed/cloud-based machine learning (ML) tools that play a key role to design, develop and deploy data models. The paper investigates case studies on distributed ML tools such as Mahout, Spark MLlib, and FlinkML. Further, we classify analytics based on the type of data, domain, and application. We distinguish various visualization tools pertaining three parameters: functionality, analysis capabilities, and supported development environment. Furthermore, we systematically investigate big data tools and technologies (Hadoop 3.0, Spark 2.3) including distributed/cloud-based stream processing tools in a comparative approach. Moreover, we discuss functionalities of several SQL Query tools on Hadoop based on 10 parameters. Finally, We present some critical points relevant to research directions and opportunities according to the current trend of big data. Investigating infrastructure tools for big data with recent developments provides a better understanding that how different tools and technologies apply to solve real-life applications.

Keywords

Big data Components of big data system Distributed file systems NoSQL databases Visualization SQL Query tools Data analytics 

References

  1. 1.
    The size of the world wide web (the internet). http://worldwidewebsize.com/
  2. 2.
    Mattmann CA (2013) Computing: a vision for data science. Nature 493(7433):473–475Google Scholar
  3. 3.
    National Aeronautics and Space Administration. https://www.nasa.gov/
  4. 4.
    Clavin W (2013) Managing the deluge of ‘big data’ from space. NASA Jet Propulsion LabratoryGoogle Scholar
  5. 5.
    Atzori L, Iera A, Morabito G (2010) The internet of things: a survey. Comput Netw 54(15):2787–2805zbMATHGoogle Scholar
  6. 6.
    SCB Intelligence (2008) Six technologies with potential impacts on us interests out to 2025. National Intelligent Concil, Tech. RepGoogle Scholar
  7. 7.
    Yu S, Liu M, Dou W, Liu X, Zhou S (2017) Networking for big data: a survey. IEEE Commun Surv Tutor 19(1):531–549Google Scholar
  8. 8.
    Pouyanfar S, Yang Y, Chen S-C, Shyu M-L, Iyengar SS (2018) Multimedia big data analytics: a survey. ACM Comput Surv 51(1):10Google Scholar
  9. 9.
    Alaba FA, Othman M, Hashem IAT, Alotaibi F (2017) Internet of things security: a survey. J Netw Comput Appl 88:10–28Google Scholar
  10. 10.
    Zikopoulos P, Eaton C, et al (2011) Understanding big data: analytics for enterprise class hadoop and streaming data. ISBN: 0071790535Google Scholar
  11. 11.
    Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Netw Appl 19(2):171–209Google Scholar
  12. 12.
    Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of big data on cloud computing: review and open research issues. Inf Syst 47:98–115Google Scholar
  13. 13.
    Ma C, Zhang HH, Wang X (2014) Machine learning for big data analytics in plants. Trends Plant Sci 19(12):798–808Google Scholar
  14. 14.
    Laney D (2013) 3d data management: controlling data volume, velocity and variety. META Group Research Note 6(70), 1Google Scholar
  15. 15.
    Fan W, Bifet A (2013) Mining big data: current status, and forecast to the future. ACM sIGKDD Explor Newsl 14(2):1–5Google Scholar
  16. 16.
    Demchenko Y, De Laat C, Membrey P (2014) Defining architecture components of the big data ecosystem. In: Collaboration technologies and systems (CTS), 2014 international conference on, pp 104–112Google Scholar
  17. 17.
    Fernández A, del Río S, López V, Bawakid A, del Jesus MJ, Benítez JM, Herrera F (2014) Big data with cloud computing: an insight on the computing environment, mapreduce, and programming frameworks. Wiley Interdiscip Rev: Data Min Knowl Discov 4(5):380–409Google Scholar
  18. 18.
    Assunção MD, Calheiros RN, Bianchi S, Netto MAS, Buyya R (2015) Big data computing and clouds: trends and future directions. J Parallel Distrib Comput 79:3–15Google Scholar
  19. 19.
    Emani CK, Cullot N, Nicolle C (2015) Understandable big data: a survey. Comput Sci Rev 17:70–81MathSciNetGoogle Scholar
  20. 20.
    Schuelke-Leech B-A, Barry B, Muratori M, Yurkovich BJ (2015) Big data issues and opportunities for electric utilities. Renew Sustain Energy Rev 52:937–947Google Scholar
  21. 21.
    O’Leary DE (2015) Big data and privacy: emerging issues. IEEE Intell Syst 30(6):92–96Google Scholar
  22. 22.
    Kune R, Konugurthi PK, Agarwal A, Chillarige RR, Buyya R (2016) The anatomy of big data computing. Softw: Pract Exp 46(1):79–105Google Scholar
  23. 23.
    Bello-Orgaz G, Jung JJ, Camacho D (2016) Social big data: recent achievements and new challenges. Inf Fusion 28:45–59Google Scholar
  24. 24.
    Bajaber F, Elshawi R, Batarfi O, Altalhi A, Barnawi A, Sakr S (2016) Big data 2.0 processing systems: taxonomy and open challenges. J Grid Comput 14(3):379–405Google Scholar
  25. 25.
    Nadal S, Herrero V, Romero O, Abell A, Franch X, Vansummeren S, Valerio D (2017) A software reference architecture for semantic-aware big data systems. Inf Softw Technol 90:75–92Google Scholar
  26. 26.
    Big data and veracity challenges. https://www.isical.ac.in/~acmsc/TMW2014/LVS.pdf
  27. 27.
    Gandomi A, Haider M (2015) Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manag 35(2):137–144Google Scholar
  28. 28.
    Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horiz 60(3):293–303Google Scholar
  29. 29.
    Kung S-Y (2015) Visualization of big data. In: Cognitive informatics and cognitive computing (ICCI* CC), 2015 IEEE 14th international conference on, pp 447–448Google Scholar
  30. 30.
    Strohbach M, Ziekow H, Gazis V, Akiva N (2015) Towards a big data analytics framework for IoT and smart city applications. In: Modeling and processing for next-generation big-data technologies. pp 257–282. ISBN: 14-9783319385006Google Scholar
  31. 31.
    Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107Google Scholar
  32. 32.
    Wu X, Chen H, Wu G, Liu J, Zheng Q, He X, Zhou A, Zhao Z-Q, Wei B, Ming G (2015) Knowledge engineering with big data. IEEE Intell Syst 30(5):46–55Google Scholar
  33. 33.
    Wu X, Chen H, Liu J, Gongqing W, Ruqian L, Zheng N (2017) Knowledge engineering with big data (bigke): a 54-month, 45-million rmb, 15-institution national grand project. IEEE Access 5:12696–12701Google Scholar
  34. 34.
    Venner J, Wadkar S, Siddalingaiah M (2014) Pro apache hadoop. ISBN-13: 9781430248637Google Scholar
  35. 35.
    Pavlo A, Paulson E, Rasin A, Abadi DJ, DeWitt DJ, Madden S, Stonebraker M (2009) A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data, pp 165–178Google Scholar
  36. 36.
  37. 37.
    Chang L, Wang Z, Ma T, Jian L, Ma L, Goldshuv A, Lonergan L, Cohen J, Welton C, Sherry G et al (2014) HAWQ: a massively parallel processing SQL engine in hadoop. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, pp 1223–1234Google Scholar
  38. 38.
  39. 39.
  40. 40.
    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113Google Scholar
  41. 41.
    Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111Google Scholar
  42. 42.
    Lenharth A, Nguyen D, Pingali K (2016) Parallel graph analytics. Commun ACM 59(5):78–87Google Scholar
  43. 43.
    Apache hama project. https://hama.apache.org/
  44. 44.
    Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 135–146Google Scholar
  45. 45.
    Apache giraph project. http://giraph.apache.org/
  46. 46.
    Zhang H, Chen G, Ooi BC, Tan K-L, Zhang M (2015) In-memory big data management and processing: a survey. IEEE Trans Knowl Data Eng 27(7):1920–1948Google Scholar
  47. 47.
    Cai Q, Zhang H, Guo W, Chen G, Ooi BC, Tan K-L, Wong WF (2018) Memepic: towards a unified in-memory big data management system. IEEE Trans Big DataGoogle Scholar
  48. 48.
    Lim H, Han D, Andersen DG, Kaminsky M (2014) Mica: a holistic approach to fast in-memory key-value storage. USENIX, pp 429–444Google Scholar
  49. 49.
    Kuznetsov SD, Poskonin AV (2014) Nosql data management systems. Program Comput Softw 40(6):323–332Google Scholar
  50. 50.
  51. 51.
    Chen CLP, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347Google Scholar
  52. 52.
    Mazón J-N, Lechtenbörger J, Trujillo J (2009) A survey on summarizability issues in multidimensional modeling. Data Knowl Eng 68(12):1452–1469Google Scholar
  53. 53.
    Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687Google Scholar
  54. 54.
    Gantz J, Reinsel D (2011) Extracting value from chaos. IDC iview 1142:1–12Google Scholar
  55. 55.
    Kouzes RT, Anderson GA, Elbert ST, Gorton I, Gracio DK (2009) The changing paradigm of data-intensive computing. IEEE Comput 42(1):26–34Google Scholar
  56. 56.
    Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proc VLDB Endow 5(12):2032–2033Google Scholar
  57. 57.
    UN Global Pulse (2012) Big data for development: challenges and opportunities. UN Global Pulse, New YorkGoogle Scholar
  58. 58.
    Kambatla K, Kollias G, Kumar V, Grama A (2014) Trends in big data analytics. J Parallel Distrib Comput 74(7):2561–2573Google Scholar
  59. 59.
    Chen Y, Qin X, Bian H, Chen J, Dong Z, Du X, Gao Y, Liu D, Lu J, Zhang H (2014) A study of SQL-on-hadoop systems. In: Workshop on big data benchmarks, performance optimization, and emerging hardware, pp 154–166Google Scholar
  60. 60.
    Mohammed EA, Far BH, Naugler C (2014) Applications of the mapreduce programming framework to clinical big data analysis: current landscape and future trends. BioData Min 7(1):1Google Scholar
  61. 61.
    Yang C, Huang Q, Li Z, Liu K, Hu F (2017) Big data and cloud computing: innovation opportunities and challenges. Int J Digit Earth 10(1):13–53Google Scholar
  62. 62.
    Oussous A, Benjelloun F-Z, Lahcen AA, Belfkih S (2017) Big data technologies: a survey. J King Saud Univ-Comput Inf SciGoogle Scholar
  63. 63.
    Salloum S, Dautov R, Chen X, Peng PX, Huang JZ (2016) Big data analytics on apache spark. Int J Data Sci Anal, pp 1–20Google Scholar
  64. 64.
    de Assuncao MD, da Silva Veith A, Buyya R (2018) Distributed data stream processing and edge computing: a survey on resource elasticity and future directions. J Netw Comput Appl 103:1–17Google Scholar
  65. 65.
    Krumm J, Davies N, Narayanaswami C (2008) User-generated content. IEEE Pervasive Comput 4(7):10–11Google Scholar
  66. 66.
  67. 67.
    Shameer K, Badgeley MA, Miotto R, Glicksberg BS, Morgan JW, Dudley JT (2016) Translational bioinformatics in the era of real-time biomedical, health care and wellness data streams. Briefings in Bioinformatics, bbv118Google Scholar
  68. 68.
    Marx V (2013) Biology: the big challenges of big data. Nature 498(7453):255–260Google Scholar
  69. 69.
    Cook CE, Bergman MT, Cochrane G, Apweiler R, Birney E (2017) The european bioinformatics institute in 2017: data coordination and integration. Nucleic Acids Res 46(D1):D21–D29Google Scholar
  70. 70.
    Akter S, Wamba SF (2016) Big data analytics in e-commerce: a systematic review and agenda for future research. Electron Mark 26(2):173–194Google Scholar
  71. 71.
  72. 72.
  73. 73.
    Sun J, Reddy CK (2013) Big data analytics for healthcare. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 1525–1525Google Scholar
  74. 74.
    Ranjan R, Georgakopoulos D, Wang L (2016) A note on software tools and technologies for delivering smart media-optimized big data applications in the cloud. Computing 98(1–2):1–5MathSciNetzbMATHGoogle Scholar
  75. 75.
  76. 76.
    Twitter statistics and facts. https://www.statista.com/topics/737/twitter/
  77. 77.
    Twitter by the numbers: stats, demographics and fun facts. https://www.omnicoreagency.com/twitter-statistics/
  78. 78.
    Number of monthly active facebook users worldwide as of 4th quarter 2017. https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/
  79. 79.
    Rob Kitchin (2017) Big data. The International Encyclopedia of GeographyGoogle Scholar
  80. 80.
    Gudivada VN, Baeza-Yates RA, Raghavan VV (2017) Big data: promises and problems. IEEE Comput 48(3):20–23Google Scholar
  81. 81.
    Al-Fuqaha A, Guizani M, Mohammadi M, Aledhari M, Ayyash M (2015) Internet of things: a survey on enabling technologies, protocols, and applications. IEEE Commun Surv Tutor 17(4):2347–2376Google Scholar
  82. 82.
    Raun J, Ahas R, Tiru M (2016) Measuring tourism destinations using mobile tracking data. Tour Manag 57:202–212Google Scholar
  83. 83.
    Kitchin R (2014) The data revolution: Big data, open data, data infrastructures and their consequences. Sage, ISBN: 13-9781446287484Google Scholar
  84. 84.
    Abiteboul S, Manolescu I, Rigaux P, Rousset M-C, Senellart P (2011) Web data management. Cambridge University Press, ISBN-13: 9781107012431Google Scholar
  85. 85.
    Ghemawat S, Gobioff H, Leung S-T (2003) The google file system. In: ACM SIGOPS operating systems review, vol 37, pp 29–43Google Scholar
  86. 86.
    Doctorow C (2008) Big data: welcome to the petacenre. Nat News 455(7209):16–21Google Scholar
  87. 87.
    Ovsiannikov M, Rus S, Reeves D, Sutter P, Rao S, Kelly J (2013) The quantcast file system. Proc VLDB Endow 6(11):1092–1101Google Scholar
  88. 88.
    Guerraoui R, Schiper A (1996) Fault-tolerance by replication in distributed systems. In: International conference on reliable software technologies, pp 38–57Google Scholar
  89. 89.
    Wiesmann M, Pedone F, Schiper A, Kemme B, Alonso G (2000) Understanding replication in databases and distributed systems. In: Distributed computing systems, 2000. Proceedings of 20th international conference on, pp 464–474Google Scholar
  90. 90.
    Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), pp 1–10Google Scholar
  91. 91.
  92. 92.
    Schmuck FB, Haskin RL (2002) Gpfs: a shared-disk file system for large computing clusters. In: FAST, vol 2, pp 231–244Google Scholar
  93. 93.
    Jones T, Koniges AE, Yates RK (2000) Performance of the IBM general parallel file system. In: IPDPS, pp 673–681Google Scholar
  94. 94.
  95. 95.
    Thanh TD, Mohan S, Choi E, Kim SB, Kim P (2008) A taxonomy and survey on distributed file systems. In: Networked computing and advanced information management, 2008. NCM’08. Fourth international conference on 1, pp 144–149Google Scholar
  96. 96.
    Beaver D, Kumar S, Li HC, Sobel J, Vajgel P (2010) Finding a needle in haystack: facebook’s photo storage. OSDI 10:1–8Google Scholar
  97. 97.
    Fetterly D, Haridasan M, Isard M, Sundararaman S (2011) Tidyfs: a simple and small distributed file system. In: USENIX annual technical conference, pp 34–34Google Scholar
  98. 98.
  99. 99.
  100. 100.
    Brewer E (2010) A certain freedom: thoughts on the cap theorem. In: Proceedings of the 29th ACM SIGACT-SIGOPS symposium on principles of distributed computing, pp 335–335Google Scholar
  101. 101.
    Lourenço JR, Cabral B, Carreiro P, Vieira M, Bernardino J (2015) Choosing the right nosql database for the job: a quality attribute evaluation. J Big Data 2(1):1–26Google Scholar
  102. 102.
    Buyya R, Calheiros RN, Dastjerdi AV (2016) Big data: principles and paradigms. Morgan Kaufmann, ISBN-13: 9780128053942Google Scholar
  103. 103.
    Abadi D, Boncz P, Harizopoulos S, Idreos S, Madden S et al (2013) The design and implementation of modern column-oriented database systems. Now 5(3):197–280Google Scholar
  104. 104.
    Matei G, Bank RC (2010) Column-oriented databases, an alternative for analytical environment. Database Syst J 1(2):3–16Google Scholar
  105. 105.
    Floratou A, Patel JM, Shekita EJ, Tata S (2011) Column-oriented storage techniques for mapreduce. Proc VLDB Endow 4(7):419–429Google Scholar
  106. 106.
    Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst 26(2):1–26Google Scholar
  107. 107.
    Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system. ACM SIGOPS Oper Syst Rev 44(2):35–40Google Scholar
  108. 108.
    Stonebraker M, Abadi DJ, Batkin A, Chen X, Cherniack M, Ferreira M, Lau E, Lin A, Madden S, O’Neil E et al. (2005) C-store: a column-oriented DBMS. In: Proceedings of the 31st international conference on very large data bases, pp 553–564Google Scholar
  109. 109.
    Boncz PA, Zukowski M, Nes N (2005) Monetdb/x100: hyper-pipelining query execution. CIDR 5:225–237Google Scholar
  110. 110.
    Idreos S, Groffen F, Nes N, Manegold S, Mullender S, Kersten M (2012) Monetdb: two decades of research in column-oriented database architectures. Bull IEEE Comput Soc Tech Comm Data Eng 35(1):40–45Google Scholar
  111. 111.
    Sciore E (2007) Simpledb: a simple java-based multiuser syst for teaching database internals. ACM SIGCSE Bull 39(1):561–565Google Scholar
  112. 112.
    Zukowski M, Boncz P (2012) Vectorwise: beyond column stores. IEEE Data Eng Bull 35(1):21–27Google Scholar
  113. 113.
    Edward SG, Sabharwal N (2015) Mongodb limitations. In: Practical MongoDB, pp 227–232Google Scholar
  114. 114.
  115. 115.
  116. 116.
    DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W (2007) Dynamo: amazon’s highly available key-value store. ACM SIGOPS Oper Syst Rev 41(6):205–220Google Scholar
  117. 117.
    Basho products-riak products. http://basho.com/products/
  118. 118.
    Sumbaly R, Kreps J, Gao L, Feinberg A, Soman C, Shah S (2012) Serving large-scale batch computed data with project voldemort. In: Proceedings of the 10th USENIX conference on file and storage technologies, pp 18–18Google Scholar
  119. 119.
    Gudivada VN, Rao D, Raghavan VV (2014) NoSQL systems for big data management. In: 2014 IEEE World congress on services, pp 190–197Google Scholar
  120. 120.
  121. 121.
  122. 122.
  123. 123.
    Moniruzzaman ABM, Hossain SA (2013) Nosql database: new era of databases for big data analytics-classification, characteristics and comparison. arXiv preprint arXiv:1307.0191
  124. 124.
  125. 125.
  126. 126.
    Khetrapal A, Ganesh V (2006) Hbase and hypertable for large scale distributed storage systems. Dept. of Computer Science, Purdue University, pp 22–28Google Scholar
  127. 127.
    Apache accumulo project. https://accumulo.apache.org/
  128. 128.
    Ghaffari Amir, Chechina Natalia, Trinder Phil, Meredith Jon (2013) Scalable persistent storage for Erlang: theory and practice. In: Proceedings of the twelfth ACM SIGPLAN workshop on Erlang, pp 73–74Google Scholar
  129. 129.
    Vogels W (2009) Eventually consistent. Commun ACM 52(1):40–44Google Scholar
  130. 130.
  131. 131.
  132. 132.
    Redis project. https://redis.io/
  133. 133.
    Random notes on improving the Redis LRU algorithm. http://antirez.com/news/109
  134. 134.
  135. 135.
    Redis cluster specification. https://redis.io/topics/cluster-spec
  136. 136.
  137. 137.
    The apache mahout project. https://mahout.apache.org/
  138. 138.
  139. 139.
  140. 140.
  141. 141.
    Meng X, Bradley J, Yuvaz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S et al (2016) Mllib: Machine learning in apache spark. JMLR 17(34):1–7MathSciNetzbMATHGoogle Scholar
  142. 142.
    Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65Google Scholar
  143. 143.
    Machine learning library (mllib) guide. https://spark.apache.org/docs/latest/ml-guide.html
  144. 144.
    Different default regparam values in als. https://issues.apache.org/jira/browse/SPARK-19787
  145. 145.
  146. 146.
    Carbone P, Ewen S, Haridi S, Katsifodimos A, Markl V, Tzoumas K (2015) Apache flink: stream and batch processing in a single engine. Data Eng 38:28–38Google Scholar
  147. 147.
    Introducing Neo4j Bloom: Graph Data Visualization for Everyone. https://neo4j.com/blog/introducing-neo4j-bloom-graph-data-visualization-for-everyone/
  148. 148.
    Orange documentation https://orange.biolab.si/docs/
  149. 149.
    Raghavan UN, Réka A, Kumara S (2007) Near linear time algorithm to detect community structures in large-scale networks. Phys Rev E 76(3):036106Google Scholar
  150. 150.
    Chappell D (2015) Introducing azure machine learning. A guide for technical professionals, sponsored by microsoft corporationGoogle Scholar
  151. 151.
    Overview diagram of azure machine learning studio capabilities. https://docs.microsoft.com/en-in/azure/machine-learning/studio/studio-overview-diagram
  152. 152.
    Azure capabilities, limitations and support. https://docs.microsoft.com/en-us/azure/machine-learning/studio/faq
  153. 153.
  154. 154.
    Amazon machine learning. https://aws.amazon.com/aml/
  155. 155.
    Amazon sagemaker features. https://aws.amazon.com/sagemaker/features/
  156. 156.
  157. 157.
    Role of spark in transforming ebay’s enterprise data platform. https://databricks.com/session/role-of-spark-in-transforming-ebays-enterprise-data-platform
  158. 158.
    Number of full-time employees at alibaba from 2012 to 2017. https://www.statista.com/statistics/226794/number-of-employees-at-alibabacom/
  159. 159.
    Number of active consumers across alibaba’s online shopping. https://www.statista.com/statistics/226927/alibaba-cumulative-active-online-buyers-taobao-tmall/
  160. 160.
    Huang L, Hu G, Lu X (2009) E-business ecosystem and its evolutionary path: the case of the alibaba group in china. Pacific Asia J Assoc Inf Syst 1(4)Google Scholar
  161. 161.
    A year of blink at alibaba: apache flink in large scale production. http://www.dataversity.net/year-blink-alibaba/
  162. 162.
    Gupta P, Sharma A, Jindal R (2016) Scalable machine-learning algorithms for big data analytics: a comprehensive review. Wiley Interdiscip Rev: Data Min Knowl Discov 6(6):194–214Google Scholar
  163. 163.
  164. 164.
    Ji X, Chun SA, Cappellari P, Geller J (2017) Linking and using social media data for enhancing public health analytics. J Inf Sci 43(2):221–245Google Scholar
  165. 165.
    Kanaujia PKM, Pandey M, Rautaray SS (2017) Real time financial analysis using big data technologies. In: I-SMAC (IoT in social, mobile, analytics and cloud)(I-SMAC), 2017 international conference on, pp 131–136Google Scholar
  166. 166.
    Moe WW, Schweidel DA (2017) Opportunities for innovation in social media analytics. J Prod Innov Manag 34(5):697–702Google Scholar
  167. 167.
    Psyllidis A, Bozzon A, Bocconi S, Bolivar CT (2015) A platform for urban analytics and semantic data integration in city planning. In: International conference on computer-aided architectural design futures, pp 21–36Google Scholar
  168. 168.
    Gust G, Flath C, Brandt T, Ströhle P, Neumann D (2016) Bringing analytics into practice: evidence from the power sectorGoogle Scholar
  169. 169.
    Nguyen D, Lenharth A, Pingali K (2013) A lightweight infrastructure for graph analytics. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles, pp 456–471Google Scholar
  170. 170.
    Baesens B, Van Vlasselaer V, Verbeke W (2015) Fraud analytics: a broader perspective. Fraud analytics using descriptive, predictive, and social network techniques: a guide to data science for fraud detection, pp 313–346Google Scholar
  171. 171.
    Xu Z, Mei L, Chuanping H, Liu Y (2016) The big data analytics and applications of the surveillance system using video structured description technology. Cluster Comput 19(3):1283–1292Google Scholar
  172. 172.
    Bisias D, Flood M, Lo AW, Valavanis S (2012) A survey of systemic risk analytics. Annu Rev Financ Econ 4(1):255–296Google Scholar
  173. 173.
    Sagiroglu S, Sinanc D (2013) Big data: a review. In: Collaboration technologies and systems (CTS), 2013 international conference on, pp 42–47Google Scholar
  174. 174.
    Rabkin A, Arye M, Sen S, Pai VS, Freedman MJ (2014) Aggregation and degradation in JetStream: streaming analytics in the wide area. In: NSDI vol 14, 275–288Google Scholar
  175. 175.
    Zhang L, Stoffel A, Behrisch M, Mittelstadt S, Schreck T, Pompl R, Weber S, Last H, Keim D (2012) Visual analytics for the big data era comparative review of state-of-the-art commercial systems. In: Visual analytics science and technology (VAST), 2012 IEEE conference on, pp 173–182Google Scholar
  176. 176.
    Waller MA, Fawcett SE (2013) Data science, predictive analytics, and big data: a revolution that will transform supply chain design and management. J Bus Logist 34(2):77–84Google Scholar
  177. 177.
    Chen H, Chiang RHL, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188Google Scholar
  178. 178.
    Raghupathi W, Raghupathi V (2013) An overview of health analytics. J Health Med Inform 4(3):1–11Google Scholar
  179. 179.
    Cook DJ, Holder LB (2006) Mining graph data. Wiley, LondonzbMATHGoogle Scholar
  180. 180.
    Fortunato S (2010) Community detection in graphs. Phys Rep 486(3):75–174MathSciNetGoogle Scholar
  181. 181.
    Xin RS, Gonzalez JE, Franklin MJ, Stoica I (2013) Graphx: a resilient distributed graph system on spark. In: First international workshop on graph data management experiences and systems 2(1–2):6Google Scholar
  182. 182.
    Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C (2011) Graphlab: A distributed framework for machine learning in the cloud. arXiv preprint arXiv:1107.0922
  183. 183.
    Introducing gelly: Graph processing with apache flink. https://flink.apache.org/news/2015/08/24/introducing-flink-gelly.html
  184. 184.
    Liu B (2007) Web data mining: exploring hyperlinks, contents, and usage data. Springer, Berlin. ISBN-13: 9783642194597Google Scholar
  185. 185.
    Wesley R, Eldridge M, Terlecki PT (2011) An analytic data engine for visualization in tableau. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, pp 1185–1194Google Scholar
  186. 186.
    García M, Harmsen B (2012) Qlikview 11 for developers. Packt Publishing LtdGoogle Scholar
  187. 187.
  188. 188.
    Microstrategy enterprise analytics and mobility. http://www.microstrategy.com/us/capabilities/visualizations
  189. 189.
  190. 190.
    Abousalh-Neto NA, Kazgan S (2012) Big data exploration through visual analytics. In: Visual analytics science and technology (VAST), 2012 IEEE conference on, pp 285–286Google Scholar
  191. 191.
  192. 192.
  193. 193.
    Smoot ME, Ono K, Ruscheinski J, Wang P-L, Ideker T (2011) Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 27(3):431–432Google Scholar
  194. 194.
    Batagelj V, Mrvar A (1998) Pajek-program for large network analysis. Connections 21(2):47–57zbMATHGoogle Scholar
  195. 195.
    Smith MA, Shneiderman B, Milic-Frayling N, Mendes Rodrigues E, Barash V, Dunne C, Capone T, Perer A, Gleave E (2009) Analyzing (social media) networks with NodeXL. In: Proceedings of the fourth international conference on communities and technologies, pp 255–264Google Scholar
  196. 196.
    Bastian M, Heymann S, Jacomy M et al (2009) Gephi: an open source software for exploring and manipulating networks. ICWSM 8:361–362Google Scholar
  197. 197.
    Csardi G, Nepusz T (2006) The igraph software package for complex network research. Int J Complex Syst 1695(5):1–9Google Scholar
  198. 198.
    Apache hadoop project. http://hadoop.apache.org
  199. 199.
    Sakr S, Liu A, Fayoumi AG (2013) The family of mapreduce and large-scale data processing systems. ACM Comput Surv 46(1):11Google Scholar
  200. 200.
    Lee K-H, Lee Y-J, Choi H, Chung YD, Moon B (2012) Parallel data processing with mapreduce: a survey. AcM sIGMoD Rec 40(4):11–20Google Scholar
  201. 201.
    Chen Y, Kreulen J, Campbell M, Abrams C (2011) Analytics ecosystem transformation: a force for business model innovation. In: 2011 Annual SRII global conference, pp 11–20Google Scholar
  202. 202.
    Venner J, Wadkar S, Siddalingaiah M (2014) Pro apache Hadoop. ISBN: 9781430248637Google Scholar
  203. 203.
  204. 204.
  205. 205.
    Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe Jason, Shah Hitesh, Seth Siddharth et al (2013) Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th annual symposium on cloud computing, pp 5:1–16Google Scholar
  206. 206.
  207. 207.
  208. 208.
    Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10:10–10Google Scholar
  209. 209.
    Marcu O-C, Costan A, Antoniu G, Pérez-Hernández MS (2016) Spark versus flink: understanding performance in big data analytics frameworks. In: Cluster computing (CLUSTER), 2016 IEEE international conference on, pp 433–442Google Scholar
  210. 210.
  211. 211.
    Rensin DK (2015) Kubernetes-scheduling the future at cloud scaleGoogle Scholar
  212. 212.
    Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Antony S, Liu H, Murthy R (2010) Hive-a petabyte scale data warehouse using hadoop. In: 2010 IEEE 26th international conference on data engineering (ICDE 2010), pp 996–1005Google Scholar
  213. 213.
  214. 214.
    Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, et al (2015) Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1383–1394Google Scholar
  215. 215.
    Traverso M (2013) Presto: interacting with petabytes of data at facebook. Retrieved February 4:2014Google Scholar
  216. 216.
    Hausenblas M, Nadeau J (2013) Apache drill: interactive ad-hoc analysis at scale. Big Data 1(2):100–104Google Scholar
  217. 217.
  218. 218.
    Ho L-Y, Li T-H, Wu J-J, Liu P (2013) Kylin: an efficient and scalable graph data processing system. In: Big data, 2013 IEEE international conference on, pp 193–198Google Scholar
  219. 219.
    Lamb A, Fuller M, Varadarajan R, Tran N, Vandiver B, Doshi L, Bear C (2012) The vertica analytic database: C-store 7 years later. Proc VLDB Endow 5(12):1790–1801Google Scholar
  220. 220.
    Chattopadhyay B, Lin L, Liu W, Mittal S, Aragonda P, Lychagina V, Kwon Y, Wong M (2011) Tenzing a SQL implementation on the mapreduce frameworkGoogle Scholar
  221. 221.
    Floratou A, Minhas UF, Özcan F (2014) Sql-on-hadoop: full circle back to shared-nothing database architectures. Proc VLDB Endow 7(12):1295–1306Google Scholar
  222. 222.
    Nasir MAU (2016) Fault tolerance for stream processing engines. arXiv preprint arXiv:1605.00928
  223. 223.
  224. 224.
  225. 225.
    van der Veen JS, van der Waaij B, Lazovik E, Wijbrandi W, Meijer RJ (2015) Dynamically scaling apache storm for the analysis of streaming data. In: Big data computing service and applications (BigDataService), 2015 IEEE first international conference on, pp 154–161Google Scholar
  226. 226.
    Toshniwal A, Taneja S, Shukla A, Ramasamy K, Patel JM, Kulkarni S, Jackson J, Gade K, Fu M, Donham J et al (2014) Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, pp 147–156Google Scholar
  227. 227.
  228. 228.
  229. 229.
  230. 230.
    Bockermann C (2014) A survey of the stream processing landscape. Lehrstuhl fork unstliche Intelligenz Technische Universit. at DortmundGoogle Scholar
  231. 231.
    Neumeyer L, Robbins B, Nair A, Kesari A (2010) S4: distributed stream computing platform. In: Data mining workshops (ICDMW), 2010 IEEE international conference on, pp 170–177Google Scholar
  232. 232.
    Zaharia M, Das T, Li H, Shenker S, Stoica I (2012) Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. HotCloud 12:10–10Google Scholar
  233. 233.
    Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles, pp 423–438Google Scholar
  234. 234.
  235. 235.
  236. 236.
  237. 237.
    Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3(1):63–75Google Scholar
  238. 238.
  239. 239.
    Alexandrov A, Bergmann R, Ewen S, Freytag J-C, Hueske F, Heise A, Kao O, Leich M, Leser U, Markl V (2014) The stratosphere platform for big data analytics. VLDB J 23(6):939–964Google Scholar
  240. 240.
  241. 241.
  242. 242.
  243. 243.
    De Morales GF, Bifet A (2015) Samoa: scalable advanced massive online analysis. J Mach Learn Res 16(1):149–153Google Scholar
  244. 244.
  245. 245.
  246. 246.
    Akidau T, Balikov A, Bekiroğlu K, Chernyak S, Haberman J, Lax R, McVeety S, Mills D, Nordstrom P, Whittle S (2013) Millwheel: fault-tolerant stream processing at internet scale. Proc VLDB Endow 6(11):1033–1044Google Scholar
  247. 247.
    Kulkarni S, Bhagat N, Fu M, Kedigehalli V, Kellogg C, Mittal S, Patel JM, Ramasamy K, Taneja S (2015) Twitter heron: stream processing at scale. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 239–250Google Scholar
  248. 248.
    Abadi D, Carney D, Cetintemel U, Cherniack M, Convey C, Erwin C, Galvez E, Hatoun M, Maskey A, Rasin A et al (2003) Aurora: a data stream management system. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, pp 666–666Google Scholar
  249. 249.
  250. 250.
  251. 251.
  252. 252.
    Fu M, Agrawal A, Floratou A, Graham B, Jorgensen A, Li M, Lu N, Ramasamy K, Rao S, Wang C (2017) Twitter heron: towards extensible streaming engines. In: Data engineering (ICDE), 2017 IEEE 33rd international conference on, pp 1165–1172Google Scholar
  253. 253.
  254. 254.
  255. 255.
  256. 256.
  257. 257.
  258. 258.
    Shukla A, Chaturvedi S, Simmhan Y (2017) Riotbench: a real-time iot benchmark for distributed stream processing platforms. arXiv preprint arXiv:1701.08530
  259. 259.
    Dreissig F, Pollner N (2017) A data center infrastructure monitoring platform based on storm and trident. Datenbanksysteme für Business, Technologie und Web (BTW 2017)-WorkshopbandGoogle Scholar
  260. 260.
    Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C (2015) Apache tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1357–1369Google Scholar
  261. 261.
    Tpc-h is a decision support benchmark. http://www.tpc.org/
  262. 262.
  263. 263.
  264. 264.
  265. 265.
    Sebastio S, Ghosh R, Mukherjee T (2018) An availability analysis approach for deployment configurations of containers. IEEE Trans Serv ComputGoogle Scholar
  266. 266.
    Medel V, Rana O, Bañares JÁ, Arronategui Unai (2016) Modelling performance and resource management in kubernetes. In: Utility and cloud computing (UCC), 2016 IEEE/ACM 9th international conference on, pp 257–262Google Scholar
  267. 267.
    Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz RH, Shenker S, Stoica I (2011) Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, vol 11, pp 295–308Google Scholar
  268. 268.
    Amazon web services. https://aws.amazon.com/docker/
  269. 269.
    Kreps J, Narkhede N, Rao J et al (2011) Kafka: a distributed messaging system for log processing. In: Proceedings of the NetDB, pp 1–7Google Scholar
  270. 270.
  271. 271.
  272. 272.
  273. 273.
    Lampesberger H (2016) Technologies for web and cloud service interaction: a survey. Serv Oriented Comput Appl 10(2):71–110Google Scholar
  274. 274.
    Dobbelaere P, Esmaili KS (2017) Kafka versus RabbitMQ. arXiv preprint arXiv:1709.00333
  275. 275.
    Sangat P, Indrawan-Santiago M, Taniar D (2018) Sensor data management in the cloud: data storage, data ingestion, and data retrieval. Concurr Comput: Pract Exp 30(1)Google Scholar
  276. 276.
    Hoffman S (2013) Apache flume: distributed log collection for hadoop. Packt Publishing LtdGoogle Scholar
  277. 277.
    Ting K, Cecho JJ (2013) Apache Sqoop Cookbook. O’Reilly Media, IncGoogle Scholar
  278. 278.
    Rabkin A, Katz RH (2010) Chukwa: a system for reliable large-scale log collection. LISA 10:1–15Google Scholar
  279. 279.
  280. 280.
    Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2010) Graphlab: a new framework for parallel machine learning. arxiv preprint. arXiv preprint arXiv:1006.4990
  281. 281.
    Aver C (2011) Giraph: large-scale graph processing infrastructure on hadoop. In: Proceedings of the Hadoop summit. Santa Clara 11(3), 5–9Google Scholar
  282. 282.
    Gonzalez JE, Low Y, Haijie G, Bickson D, Guestrin C (2012) Powergraph: distributed graph-parallel computation on natural graphs. OSDI 12(1):2–2Google Scholar
  283. 283.
    Salihoglu S, Widom J (2013) Gps: a graph processing system. In: Proceedings of the 25th international conference on scientific and statistical database management 22, pp 1–12Google Scholar
  284. 284.
    Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I (2014) Graphx: graph processing in a distributed dataflow framework. OSDI 14:599–613Google Scholar
  285. 285.
    Xin RS, Crankshaw D, Dave A, Gonzalez JE, Franklin MJ, Stoica I (2014) Graphx: unifying data-parallel and graph-parallel analytics. arXiv preprint arXiv:1402.2394
  286. 286.
  287. 287.
    Junghanns M, Petermann A, Gómez K, Rahm E (2015) Gradoop: scalable graph data management and analytics with hadoop. arXiv preprint arXiv:1506.00548
  288. 288.
    Hunt P, Konar M, Junqueira FP, Reed B (2010) Zookeeper: Wait-free coordination for internet-scale systems. In: USENIX annual technical conference 8(9)Google Scholar
  289. 289.
  290. 290.
  291. 291.
    Hu W, Qu Y (2008) Falcon-AO: a practical ontology matching system. Web Semant: Sci Serv Agents World Wide Web 6(3):237–239Google Scholar
  292. 292.
    Apache nifi project. https://nifi.apache.org/
  293. 293.
    Islam M, Huang AK, Battisha M, Chiang M, Srinivasan S, Peters C, Neumann A, Abdelnur A (2012) Oozie: towards a scalable workflow management system for hadoop. In: Proceedings of the 1st ACM SIGMOD workshop on scalable workflow execution engines and technologies 4:1–4:10Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  • T. Ramalingeswara Rao
    • 1
    Email author
  • Pabitra Mitra
    • 2
  • Ravindara Bhatt
    • 3
  • A. Goswami
    • 1
  1. 1.Theoretical Computer Science Group, Department of MathematicsIndian Institute of Technology KharagpurKharagpurIndia
  2. 2.Department of Computer Science and EngineeringIndian Institute of Technology KharagpurKharagpurIndia
  3. 3.Department of Computer Science and EngineeringJaypee University of Information TechnologyWaknaghatIndia

Personalised recommendations