Advertisement

Setting Up a Big Data Project: Challenges, Opportunities, Technologies and Optimization

  • Roberto V. Zicari
  • Marten RosselliEmail author
  • Todor Ivanov
  • Nikolaos Korfiatis
  • Karsten Tolle
  • Raik Niemann
  • Christoph Reichenbach
Chapter
Part of the Studies in Big Data book series (SBD, volume 18)

Abstract

In the first part of this chapter we illustrate how a big data project can be set up and optimized. We explain the general value of big data analytics for the enterprise and how value can be derived by analyzing big data. We go on to introduce the characteristics of big data projects and how such projects can be set up, optimized and managed. Two exemplary real word use cases of big data projects are described at the end of the first part. To be able to choose the optimal big data tools for given requirements, the relevant technologies for handling big data are outlined in the second part of this chapter. This part includes technologies such as NoSQL and NewSQL systems, in-memory databases, analytical platforms and Hadoop based solutions. Finally, the chapter is concluded with an overview over big data benchmarks that allow for performance optimization and evaluation of big data technologies. Especially with the new big data applications, there are requirements that make the platforms more complex and more heterogeneous. The relevant benchmarks designed for big data technologies are categorized in the last part.

Keywords

Benchmark Suite Graph Database Hadoop Distribute File System Customer Lifetime Value Column Family 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    On Big Data Velocity. Interview with Scott Jarr, ODBMS Industry Watch, 28 Jan 2013. http://www.odbms.org/blog/2013/01/on-big-data-velocity-interview-with-scott-jarr/ (2015). Accessed 15 July 2015
  2. 2.
    How to run a Big Data project. Interview with James Kobielus. ODBMS Industry Watch, 15 May 2014. http://www.odbms.org/blog/2014/05/james-kobielus/ (2015). Accessed 15 July 2015
  3. 3.
    Laney, D.: 3D data management: controlling data volume, velocity and variety. Appl. Deliv. Strateg. File, 949 (2001)Google Scholar
  4. 4.
    Zikopoulos, P., Eaton, C.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, 1st ed. McGraw-Hill Osborne Media (IBM) (2011)Google Scholar
  5. 5.
    Foster, I.: Big Process for Big Data, Presented at the HPC 2012 Conference. Cetraro, Italy (2012)Google Scholar
  6. 6.
    Gattiker, A., Gebara, F.H., Hofstee, H.P., Hayes, J.D., Hylick, A.: Big Data text-oriented benchmark creation for Hadoop. IBM J. Res. Dev., 57(3/4), 10: 1–10: 6 (2013)Google Scholar
  7. 7.
    Zicari, R.: Big Data: Challenges and Opportunities. In: Akerkar, R. (ed.) Big Data Computing, p. 564. Chapman and Hall/CRC (2013)Google Scholar
  8. 8.
    On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com. ODBMS Industry Watch, 02 Nov 2011. http://www.odbms.org/blog/2011/11/on-big-data-interview-with-dr-werner-vogels-cto-and-vp-of-amazon-com/ (2015). Accessed 15 July 2015
  9. 9.
    Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner. ODBMS Industry Watch, 15 Nov 2013. http://www.odbms.org/blog/2013/11/big-data-analytics-at-thomson-reuters-interview-with-jochen-l-leidner/ (2015). Accessed 15 July 2015
  10. 10.
    Setting up a Big Data project. Interview with Cynthia M. Saracco. ODBMS Industry Watch, 27 Jan 2014. http://www.odbms.org/blog/2014/01/setting-up-a-big-data-project-interview-with-cynthia-m-saracco/ (2015). Accessed 15 July 2015
  11. 11.
    Jacobs, A.: The pathologies of big data. Commun. ACM 52(8), 36–44 (2009)CrossRefGoogle Scholar
  12. 12.
    On Big Data and Hadoop. Interview with Paul C. Zikopoulos. ODBMS Industry Watch, 10 June 2013. http://www.odbms.org/blog/2013/06/on-big-data-and-hadoop-interview-with-paul-c-zikopoulos/ (2015). Accessed 15 July 2015
  13. 13.
    Next generation Hadoop. Interview with John Schroeder. ODBMS Industry Watch, 07 Sep 2012. http://www.odbms.org/blog/2012/09/next-generation-hadoop-interview-with-john-schroeder/ (2015). Accessed 15 July 2015
  14. 14.
    On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. ODBMS Industry Watch, 05 Dec 2012. http://www.odbms.org/blog/2012/12/on-big-data-analytics-and-hadoop-interview-with-daniel-abadi/ (2015). Accessed 15 July 2015
  15. 15.
    Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. ODBMS Industry Watch, 23 Sep 2013. http://www.odbms.org/blog/2013/09/data-analytics-at-nbcuniversal-interview-with-matthew-eric-bassett/ (2015). Accessed 15 July 2015
  16. 16.
    Analytics: The real-world use of big data. How innovative enterprises extract value from uncertain data (IBM Institute for Business Value and Saïd Business School at the University of Oxford), Oct 2012Google Scholar
  17. 17.
    Hopkins, B.: The Patterns of Big Data. Forrester Research, 11 June 2013Google Scholar
  18. 18.
    Lim, H., Han, Y., Babu, S.: How to Fit when No One Size Fits. In: CIDR (2013)Google Scholar
  19. 19.
    Cattell, R.: Scalable SQL and NoSql Data Stores. SIGMOD Rec., 39(4), 27 Dec 2010Google Scholar
  20. 20.
    Gilbert, S., Lynch, N.: Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33(2), 51–59 (2002)CrossRefGoogle Scholar
  21. 21.
    Haerder, T., Reuter, A.: Principles of transaction-oriented database recovery. ACM Comput. Surv. 15(4), 287–317 (1983)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Bailis, P., Ghodsi, A.: Eventual Consistency Today: Limitations, Extensions, and Beyond. Queue 11(3), pp. 20:20–20:32, Mar 2013Google Scholar
  23. 23.
    Pritchett, D.: BASE: an acid alternative. Queue 6(3), 48–55 (2008)CrossRefGoogle Scholar
  24. 24.
    Vogels, W.: Eventually consistent. Commun. ACM 52(1), 40–44 (2009)CrossRefGoogle Scholar
  25. 25.
    Moniruzzaman, A.B.M., Hossain, S.A.: NoSQL Database: New Era of Databases for Big data Analytics—Classification, Characteristics and Comparison. CoRR (2013). arXiv:1307.0191
  26. 26.
    Datastax, Datastax Apache Cassandra 2.0 Documentation. http://www.datastax.com/documentation/cassandra/2.0/index.html (2015). Accessed 15 Apr 2015
  27. 27.
  28. 28.
    MongoDB Inc., MongoDB Documentation. http://docs.mongodb.org/manual/MongoDB-manual.pdf (2015). Accessed 15 Apr 2015
  29. 29.
    Chang, F., Dean, S., Ghemawat, W.C., Hsieh, D.A. Wallach, Burrows, M., Chandra, T., Fikes, A.,Gruber, R.E.:Bigtable: a distributed storage system for structured data. In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, vol 7, pp. 15–15. Berkeley, CA, USA (2006)Google Scholar
  30. 30.
    George, L.: HBase: The Definitive Guide, 1st ed. O’Reilly Media (2011)Google Scholar
  31. 31.
    Apache Software Foundation, The Apache HBase Reference Guide. https://hbase.apache.org/book.html
  32. 32.
    Buerli, M.: The Current State of Graph Databases, Dec-2012, http://www.cs.utexas.edu/~cannata/dbms/Class%20Notes/08%20Graph_Databases_Survey.pdf (2015). Accessed 15 Apr 2015
  33. 33.
    Angles, R.: A comparison of current graph database models. In: ICDE Workshops, pp. 171–177 (2012)Google Scholar
  34. 34.
    McColl, R.C., Ediger, D., Poovey, J., Campbell, D., Bader, D.A.: A performance evaluation of open source graph databases. In: Proceedings of the First Workshop on Parallel Programming for Analytics Applications, pp. 11–18. New York, NY, USA (2014)Google Scholar
  35. 35.
    Harris, S., Seaborne, A.: SPARQL 1.1 Query Language. SPARQL 1.1 Query Language, 21-Mar-2013. http://www.w3.org/TR/sparql11-query/ (2013)
  36. 36.
    Holzschuher, F., Peinl, R.: Performance of graph query languages: comparison of cypher, gremlin and native access in Neo4 J. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 195–204. New York, NY, USA (2013)Google Scholar
  37. 37.
    VoltDB Inc., Using VoltDB. http://voltdb.com/download/documentation/
  38. 38.
    Pezzini, M., Edjlali, R.: Gartner top technology trends, 2013. In: Memory Computing Aims at Mainstream Adoption, 31 Jan 2013Google Scholar
  39. 39.
    Herschel, G., Linden, A., Kart, L.: Gartner Magic Quadrant for Advanced Analytics Platforms, 19 Feb 2014Google Scholar
  40. 40.
    Borthakur, D.: The hadoop distributed file system: Architecture and design. Hadoop Proj. Website 11, 21 (2007)Google Scholar
  41. 41.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 29–43. New York, NY, USA (2003)Google Scholar
  42. 42.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  43. 43.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: A Warehousing Solution over a Map-reduce Framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)CrossRefGoogle Scholar
  44. 44.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1099–1110 (2008)Google Scholar
  45. 45.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. Washington, DC, USA (2010)Google Scholar
  46. 46.
    White, T.: Hadoop: The Definitive Guide, 1st ed. O’Reilly Media, Inc., (2009)Google Scholar
  47. 47.
    Apache Spark Project. http://spark.apache.org/
  48. 48.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10–10. Berkeley, CA, USA (2010)Google Scholar
  49. 49.
    Cranor, C., Johnson, T., Spataschek, O., Shkapenyuk, V.: Gigascope: a stream database for network applications. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 647–651. New York, NY, USA (2003)Google Scholar
  50. 50.
    Arasu, A., Babu, S., Widom, J.: The CQL Continuous Query Language: Semantic Foundations and Query Execution. VLDB J. 15(2), 121–142 (2006)CrossRefGoogle Scholar
  51. 51.
    Chen, J., DeWitt, D.J., Tian, F., Wang, Y.: NiagaraCQ: a scalable continuous query system for internet databases. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 379–390. New York, NY, USA (2000)Google Scholar
  52. 52.
    Agrawal, J., Diao, Y., Gyllstrom, D, Immerman, N.: Efficient pattern matching over Event streams. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 147–160. New York, NY, USA (2008)Google Scholar
  53. 53.
    Jain, N., Mishra, S., Srinivasan, A., Gehrke, J., Widom, J., Balakrishnan, H., Çetintemel, U., Cherniack, M., Tibbetts, R., Zdonik, S.: Towards a Streaming SQL Standard. Proc VLDB Endow 1(2), 1379–1390 (2008)CrossRefGoogle Scholar
  54. 54.
    Balkesen, C., Tatbul, N.: Scalable data partitioning techniques for parallel sliding window processing over data streams. In: VLDB International Workshop on Data Management for Sensor Networks (DMSN’11). Seattle, WA, USA (2011)Google Scholar
  55. 55.
    Ahmad, Y., Berg, B., Çetintemel, U., Humphrey, M., Hwang, J.-H., Jhingran, A., Maskey, A., Papaemmanouil, O., Rasin, A., Tatbul, N., Xing, W., Xing, Y., Zdonik, S.B.: Distributed operation in the Borealis stream processing engine. In: SIGMOD Conference, pp. 882–884 (2006)Google Scholar
  56. 56.
  57. 57.
  58. 58.
  59. 59.
    Gualtieri, M., Curran, R.: The Forrester Wave: Big Data Streaming Analytics Platforms, Q3 2014, 17 July 2014Google Scholar
  60. 60.
    Tibco Streambase. http://www.streambase.com
  61. 61.
    Ivanov, T., Niemann, R., Izberovic, S., Rosselli, M., Tolle, K., Zicari, R.V.: Performance evaluationi of enterprise big data platforms with HiBench. presented at the In: 9th IEEE International Conference on Big Data Science and Engineering (IEEE BigDataSE 2015), Helsinki, Finland, 20–22 Aug 2015Google Scholar
  62. 62.
    Ivanov, T., Beer, M.: Performance evaluation of spark SQL using BigBench. Presented at the In: 6th Workshop on Big Data Benchmarking (6th WBDB). Canada, Toronto, 16–17 June 2015Google Scholar
  63. 63.
    Rosselli, M., Niemann, R., Ivanov, T., Tolle, K., Zicari, R.V.: “Benchmarking the Availability and Fault Tolerance of Cassandra”, presented at the In 6th Workshop on Big Data Benchmarking (6th WBDB), June 16–17, 2015. Canada, Toronto (2015)Google Scholar
  64. 64.
    TPC, TPC-H - Homepage. http://www.tpc.org/tpch/ (2015). Accessed 15 July 2015
  65. 65.
    TPC Big Data Working Group, TPC-BD - Homepage TPC Big Data Working Group. http://www.tpc.org/tpcbd/default.asp (2015). Accessed 15 July 2015
  66. 66.
    BigData Top100, 2013. http://bigdatatop100.org/ (2015). Accessed 15 July 2015
  67. 67.
    Big Data Benchmarking Community, Big Data Benchmarking | Center for Large-scale Data Systems Research, Big Data Benchmarking Community. http://clds.ucsd.edu/bdbc/ (2015). Accessed 15 July 2015
  68. 68.
    Chen, Y.: We don’t know enough to make a big data benchmark suite-an academia-industry view. Proc. WBDB (2012)Google Scholar
  69. 69.
    Xiong, W., Yu, Z., Bei, Z., Zhao, J., Zhang, F., Zou, Y., Bai, X., Li, Y., Xu, C.: A characterization of big data Benchmarks. In: Big Data. IEEE International Conference on 2013, 118–125 (2013)Google Scholar
  70. 70.
    Luo, C., Zhan, J., Jia, Z., Wang, L., Lu, G., Zhang, L., Xu, C.-Z., Sun, N.: CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications. Front. Comput. Sci. 6(4), 347–362 (2012)MathSciNetGoogle Scholar
  71. 71.
    Chen, Y., Ganapathi, A., Griffith, R., Katz, R.: The case for evaluating MapReduce performance using workload suites. In: 2011 IEEE 19th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 390–399 (2011)Google Scholar
  72. 72.
    Chen, Y., Alspaugh, S., Katz, R.: Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads. Proc. VLDB Endow. 5(12), 1802–1813 (2012)CrossRefGoogle Scholar
  73. 73.
    Qin, X., Zhou, X.: A survey on Benchmarks for big data and some more considerations. In: Intelligent Data Engineering and Automated Learning–IDEAL. Springer 2013, 619–627 (2013)Google Scholar
  74. 74.
    Wang, L., Zhan, J., Luo, C., Zhu, Y, Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S.: Bigdatabench: a big data benchmark suite from internet services. arXiv:14011406 (2014)
  75. 75.
    AMP Lab Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/ (2015). Accessed 15 July 2015
  76. 76.
    Patil, S., Polte, M., Ren, K., Tantisiriroj, W., Xiao, L., López, J., Gibson, G., Fuchs, A., Rinaldi, B.: Ycsb ++: benchmarking and performance debugging advanced features in scalable table stores. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 9 (2011)Google Scholar
  77. 77.
    Armstrong, T.G., Ponnekanti, V., Borthakur, D., Callaghan, M.: Linkbench: a database benchmark based on the facebook social graph. In: Proceedings of the 2013 international conference on Management of data, pp. 1185–1196 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Roberto V. Zicari
    • 1
  • Marten Rosselli
    • 1
    • 2
    Email author
  • Todor Ivanov
    • 1
  • Nikolaos Korfiatis
    • 1
    • 3
  • Karsten Tolle
    • 1
  • Raik Niemann
    • 1
  • Christoph Reichenbach
    • 1
  1. 1.Frankfurt Big Data LabGoethe University FrankfurtFrankfurt Am MainGermany
  2. 2.AccentureFrankfurtGermany
  3. 3.Norwich Business SchoolUniversity of East AngliaNorwichUK

Personalised recommendations