Abstract
High Performance Computing (HPC) has traditionally been characterized by low-latency, high throughput, massive parallelism and massively distributed systems. Big Data or analytics platforms share some of the same characteristics but as of today are limited somewhat in their guarantees on latency and throughput. The application of Big Data platforms has been in solving problems where data that is being operated upon is in motion while HPC has traditionally been applied to performing scientific computations where data is at rest. The programing paradigms that are in use in Big Data platforms for example Map-Reduce (Google Research Publication: MapReduce. Retrieved November 29, 2016, from http://research.google.com/archive/mapreduce.html) and Spark streaming (Spark Streaming/Apache Spark. Retrieved November 29, 2016, from https://spark.apache.org/streaming/) have their genesis in HPC but they need to address some of the distinct characteristics of Big Data platforms. So bringing High Performance to Big Data platforms means addressing the following:
-
1.
Ingesting Data at high volume with low latency
-
2.
Processing streaming data at high volume with low latency
-
3.
Storing Data in a distributed data store
-
4.
Indexing and searching the stored data for Real–Time processing
In order to achieve 1, 2, 3, 4 mentioned above, the right hardware and software components need to be chosen. With the plethora of software stacks and different kinds of hardware infrastructure–including public/private cloud, on premise and co–located hardware there are many criteria, characteristics and metrics to be evaluated in order to make the right choices. We show that it is of the utmost importance to have the right tools to make this kind of evaluation as accurate as possible and then have the appropriate software to maintain performance of such systems as they scale. We then identify the different types of hardware infrastructure in the cloud including Amazon Web Services (AWS) (Amazon Web Services. What is AWS?. Retrieved November 29, 2016, from https://aws.amazon.com/what-is-aws), and different types of on-premise hardware infrastructure including converged hyperscale infrastructure from vendors such as Nutanix (Nutanix-The Enterprise Cloud Company. Retrieved November 29, 2016, from http://www.nutanix.com/) and traditional vendors such as Dell and HP. We also explore high-performance offerings from emerging open network switch device makers such as Cumulus (Better, Faster, Easier Networks. Retrieved November 29, 2016, from (https://cumulusnetworks.com/) and from traditional vendors such as Cisco (Cisco. Retrieved November 29, 2016, from (http://www.cisco.com/) as well as explore various storage architectures and their relative merits in the context of Big Data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
A Guide to Software-Defined Storage. (n.d.). Retrieved November 29, 2016, from http://www.computerweekly.com/guides/A-guide-to-software-defined-storage
Aerospike High Performance NoSQL Database. (n.d.). Retrieved November 29, 2016, from http://www.aerospike.com/
Amazon Elastic Block Store (EBS)—Block storage for EC2. (n.d.). Retrieved November 29, 2016, from https://aws.amazon.com/ebs/
Amazon Elastic Compute Cloud. (n.d.). Retrieved May 5, 2017, from https://aws.amazon.com/ec2/
Amazon EMR. (n.d.). Retrieved May 5, 2017, from https://aws.amazon.com/emr/
Amazon Web Services. (n.d.). What is AWS? Retrieved November 29, 2016, from https://aws.amazon.com/what-is-aws
Amazon Web Services (AWS). (n.d.). Cloud Computing Services. Retrieved November 29, 2016, from https://aws.amazon.com/
Apache Hive. (n.d.). Retrieved November 29, 2016, from https://hive.apache.org/
Apache Kafka. (n.d.). Retrieved November 29, 2016, from http://kafka.apache.org/
Apache Lucene™. (n.d.). Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™. Retrieved November 29, 2016, from http://lucene.apache.org/solr
Apache Spark™---Lightning-Fast Cluster Computing. (n.d.). Retrieved November 29, 2016, from http://spark.apache.org/
Apache Storm. (n.d.). Retrieved November 29, 2016, from http://storm.apache.org/
Better, Faster, Easier Networks. (n.d.). Retrieved November 29, 2016, from https://cumulusnetworks.com/
Cassandra. (n.d.). Manage massive amounts of data, fast, without losing sleep. Retrieved November 29, 2016, from http://cassandra.apache.org/.
Cisco. (n.d.). Retrieved November 29, 2016, from http://www.cisco.com/
Concord Documentation. (n.d.). Retrieved November 29, 2016, from http://concord.io/docs/
Data Warehouse. (n.d.). Retrieved November 29, 2016, from https://en.wikipedia.org/wiki/Data_warehouse
Databricks Spark-Perf. (n.d.). Retrieved November 29, 2016, from https://github.com/databricks/spark-perf
Datastax. (n.d.-a). Case Study: Netflix.. Retrieved November 29, 2016, from http://www.datastax.com/resources/casestudies/netflix
DataStax. (n.d.-b). Retrieved November 29, 2016, from http://www.datastax.com/
EC2 Instance Types—Amazon Web Services (AWS). (n.d.). Retrieved November 29, 2016, from https://aws.amazon.com/ec2/instance-types/
Elastic. (n.d.). An introduction to the ELK Stack (Now the Elastic Stack). Retrieved November 29, 2016, from https://www.elastic.co/webinars/introduction-elk-stack
Gartner. (n.d.). Gartner says the internet of things will transform the data center. Retrieved November 29, 2016, from http://www.gartner.com/newsroom/id/2684915
Google Research Publication. (n.d.). MapReduce. Retrieved November 29, 2016, from http://research.google.com/archive/mapreduce.html
Hive. (n.d.). A Petabyte Scale Data Warehouse using Hadoop–Facebook. Retrieved November 29, 2016, from https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-data-warehouse-using-hadoop/89508453919/
Hyperdrive Innovation. (n.d.). Retrieved November 29, 2016, from http://hypergrid.com/
Jacobi, J. L. (2015). Everything you need to know about NVMe, the insanely fast future for SSDs. Retrieved November 29, 2016, from http://www.pcworld.com/article/2899351/everything-you-need-to-know-about-nvme.html
Kafka Ecosystem at LinkedIn. (n.d.). Retrieved November 29, 2016, from https://engineering.linkedin.com/blog/2016/04/kafka-ecosystem-at-linkedin
Keen, I. O. (n.d.). Retrieved May 5 2017, from https://keen.io/
Linden, G., Smith, B., & York, J. (2003). Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 7(1), 76–80. doi:10.1109/mic.2003.1167344.
MapReduce Tutorial. (n.d.). Retrieved November 29, 2016, from https://hadoop.apache.org/docs/r2.7.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
MemSQL. (n.d.). How pinterest measures real-time user engagement with spark. Retrieved November 29, 2016, from http://blog.memsql.com/pinterest-apache-spark-use-case/
Microsoft Azure. (n.d.-a). HDInsight-Hadoop, Spark, and R Solutions for the Cloud/Microsoft Azure. Retrieved November 29, 2016, from https://azure.microsoft.com/en-us/services/hdinsight
Microsoft Azure. (n.d.-b). Cloud computing platform and services. Retrieved November 29, 2016, from https://azure.microsoft.com/
MityLytics. (n.d.). High performance analytics at scale. Retrieved November 29, 2016, from https://mitylytics.com/
Netflix. (n.d.-a). Kafka inside keystone pipeline. Retrieved November 29, 2016, from http://techblog.netflix.com/2016/04/kafka-inside-keystone-pipeline.html
Netflix. (n.d.-b). Netflix Billing Migration to AWS–Part II. Retrieved November 29, 2016, from http://techblog.netflix.com/2016/07/netflix-billing-migration-to-aws-part-ii.html
Nutanix–The Enterprise Cloud Company. (n.d.). Retrieved November 29, 2016, from http://www.nutanix.com/
O’Malley, O. (2008, May). TeraByte Sort on Apache Hadoop. Retrieved November 29, 2016, from http://sortbenchmark.org/YahooHadoop.pdf
Overview/Apache Phoenix. (n.d.). Retrieved November 29, 2016, from http://phoenix.apache.org/
Performance without Compromise/Internap. (n.d.). Retrieved November 29, 2016, from http://www.internap.com/
Platform as a Service. (n.d.). Retrieved November 29, 2016, from https://en.wikipedia.org/wiki/Platform_as_a_service
Premium Bare Metal Servers and Container Hosting–Packet. (n.d.). Retrieved November 29, 2016, from http://www.packet.net/
Real-Time Data Warehouse. (n.d.). Retrieved November 29, 2016, from http://www.memsql.com/
ScaleIO|Software-Defined Block Storage/EMC. (n.d.). Retrieved November 29, 2016, from http://www.emc.com/storage/scaleio/index.htm
SoftLayer|cloud Servers, Storage, Big Data, and more IAAS Solutions. (n.d.). Retrieved November 29, 2016, from http://www.softlayer.com/
Software-Defined Compute–Factsheet–IDC_P10666. (2005). August 31, 2016, https://www.idc.com/getdoc.jsp?containerId=IDC_P10666
Software-Defined Networking (SDN) Definition. (n.d.). Retrieved November 29, 2016, from https://www.opennetworking.org/sdn-resources/sdn-definition
Spark Streaming/Apache Spark. (n.d.). Retrieved November 29, 2016, from https://spark.apache.org/streaming/
TPC-DS–Homepage. (n.d.). Retrieved November 29, 2016, from http://www.tpc.org/tpcds/default.asp
VansonBourne. (2015). The state of big data infrastructure: benchmarking global big data users to drive future performance. Retrieved August 23, 2016, from http://www.ca.com/content/dam/ca/us/files/industry-analyst-report/the-state-of-big-datainfrastructure.pdf
Virtual Storage: Software defined storage array and hyper-converged solutions. (n.d.). Retrieved November 29, 2016, from https://www.hpe.com/us/en/storage/storevirtual.html
Welcome to Apache Flume. (n.d.). Retrieved November 29, 2016, from https://flume.apache.org/.
Welcome to Apache Pig! (n.d.). Retrieved November 29, 2016, from https://pig.apache.org/
Wilson, R. (2015). Big data needs a new type of non-volatile memory. Retrieved November 29, 2016, from http://www.electronicsweekly.com/news/big-data-needs-a-new-type-of-non-volatile-memory-2015-10/
World fastest NoSQL Database. (n.d.). Retrieved November 29, 2016, from http://www.scylladb.com/
Xia, F., Lang, L. T., Wang, L., & Vinel, A. (2012). Internet of things. International Journal of Communication Systems, 25, 1101–1102. doi:10.1002/dac.2417.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Divate, R., Sah, S., Singh, M. (2018). High Performance Computing and Big Data. In: Srinivasan, S. (eds) Guide to Big Data Applications. Studies in Big Data, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-319-53817-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-53817-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-53816-7
Online ISBN: 978-3-319-53817-4
eBook Packages: EngineeringEngineering (R0)