High Performance Computing and Big Data

Divate, Rishi; Sah, Sankalp; Singh, Manish

doi:10.1007/978-3-319-53817-4_6

Rishi Divate³,
Sankalp Sah³ &
Manish Singh³

Part of the book series: Studies in Big Data ((SBD,volume 26))

4563 Accesses
1 Citations

Abstract

High Performance Computing (HPC) has traditionally been characterized by low-latency, high throughput, massive parallelism and massively distributed systems. Big Data or analytics platforms share some of the same characteristics but as of today are limited somewhat in their guarantees on latency and throughput. The application of Big Data platforms has been in solving problems where data that is being operated upon is in motion while HPC has traditionally been applied to performing scientific computations where data is at rest. The programing paradigms that are in use in Big Data platforms for example Map-Reduce (Google Research Publication: MapReduce. Retrieved November 29, 2016, from http://research.google.com/archive/mapreduce.html) and Spark streaming (Spark Streaming/Apache Spark. Retrieved November 29, 2016, from https://spark.apache.org/streaming/) have their genesis in HPC but they need to address some of the distinct characteristics of Big Data platforms. So bringing High Performance to Big Data platforms means addressing the following:

1.
Ingesting Data at high volume with low latency
2.
Processing streaming data at high volume with low latency
3.
Storing Data in a distributed data store
4.
Indexing and searching the stored data for Real–Time processing

In order to achieve 1, 2, 3, 4 mentioned above, the right hardware and software components need to be chosen. With the plethora of software stacks and different kinds of hardware infrastructure–including public/private cloud, on premise and co–located hardware there are many criteria, characteristics and metrics to be evaluated in order to make the right choices. We show that it is of the utmost importance to have the right tools to make this kind of evaluation as accurate as possible and then have the appropriate software to maintain performance of such systems as they scale. We then identify the different types of hardware infrastructure in the cloud including Amazon Web Services (AWS) (Amazon Web Services. What is AWS?. Retrieved November 29, 2016, from https://aws.amazon.com/what-is-aws), and different types of on-premise hardware infrastructure including converged hyperscale infrastructure from vendors such as Nutanix (Nutanix-The Enterprise Cloud Company. Retrieved November 29, 2016, from http://www.nutanix.com/) and traditional vendors such as Dell and HP. We also explore high-performance offerings from emerging open network switch device makers such as Cumulus (Better, Faster, Easier Networks. Retrieved November 29, 2016, from (https://cumulusnetworks.com/) and from traditional vendors such as Cisco (Cisco. Retrieved November 29, 2016, from (http://www.cisco.com/) as well as explore various storage architectures and their relative merits in the context of Big Data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

A Guide to Software-Defined Storage. (n.d.). Retrieved November 29, 2016, from http://www.computerweekly.com/guides/A-guide-to-software-defined-storage
Aerospike High Performance NoSQL Database. (n.d.). Retrieved November 29, 2016, from http://www.aerospike.com/
Amazon Elastic Block Store (EBS)—Block storage for EC2. (n.d.). Retrieved November 29, 2016, from https://aws.amazon.com/ebs/
Amazon Elastic Compute Cloud. (n.d.). Retrieved May 5, 2017, from https://aws.amazon.com/ec2/
Amazon EMR. (n.d.). Retrieved May 5, 2017, from https://aws.amazon.com/emr/
Amazon Web Services. (n.d.). What is AWS? Retrieved November 29, 2016, from https://aws.amazon.com/what-is-aws
Amazon Web Services (AWS). (n.d.). Cloud Computing Services. Retrieved November 29, 2016, from https://aws.amazon.com/
Apache Hive. (n.d.). Retrieved November 29, 2016, from https://hive.apache.org/
Apache Kafka. (n.d.). Retrieved November 29, 2016, from http://kafka.apache.org/
Apache Lucene™. (n.d.). Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™. Retrieved November 29, 2016, from http://lucene.apache.org/solr
Apache Spark™---Lightning-Fast Cluster Computing. (n.d.). Retrieved November 29, 2016, from http://spark.apache.org/
Apache Storm. (n.d.). Retrieved November 29, 2016, from http://storm.apache.org/
Better, Faster, Easier Networks. (n.d.). Retrieved November 29, 2016, from https://cumulusnetworks.com/
Cassandra. (n.d.). Manage massive amounts of data, fast, without losing sleep. Retrieved November 29, 2016, from http://cassandra.apache.org/.
Cisco. (n.d.). Retrieved November 29, 2016, from http://www.cisco.com/
Concord Documentation. (n.d.). Retrieved November 29, 2016, from http://concord.io/docs/
Data Warehouse. (n.d.). Retrieved November 29, 2016, from https://en.wikipedia.org/wiki/Data_warehouse
Databricks Spark-Perf. (n.d.). Retrieved November 29, 2016, from https://github.com/databricks/spark-perf
Datastax. (n.d.-a). Case Study: Netflix.. Retrieved November 29, 2016, from http://www.datastax.com/resources/casestudies/netflix
DataStax. (n.d.-b). Retrieved November 29, 2016, from http://www.datastax.com/
EC2 Instance Types—Amazon Web Services (AWS). (n.d.). Retrieved November 29, 2016, from https://aws.amazon.com/ec2/instance-types/
Elastic. (n.d.). An introduction to the ELK Stack (Now the Elastic Stack). Retrieved November 29, 2016, from https://www.elastic.co/webinars/introduction-elk-stack
Gartner. (n.d.). Gartner says the internet of things will transform the data center. Retrieved November 29, 2016, from http://www.gartner.com/newsroom/id/2684915
Google Research Publication. (n.d.). MapReduce. Retrieved November 29, 2016, from http://research.google.com/archive/mapreduce.html
Hive. (n.d.). A Petabyte Scale Data Warehouse using Hadoop–Facebook. Retrieved November 29, 2016, from https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-data-warehouse-using-hadoop/89508453919/
Hyperdrive Innovation. (n.d.). Retrieved November 29, 2016, from http://hypergrid.com/
Jacobi, J. L. (2015). Everything you need to know about NVMe, the insanely fast future for SSDs. Retrieved November 29, 2016, from http://www.pcworld.com/article/2899351/everything-you-need-to-know-about-nvme.html
Kafka Ecosystem at LinkedIn. (n.d.). Retrieved November 29, 2016, from https://engineering.linkedin.com/blog/2016/04/kafka-ecosystem-at-linkedin
Keen, I. O. (n.d.). Retrieved May 5 2017, from https://keen.io/
Linden, G., Smith, B., & York, J. (2003). Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 7(1), 76–80. doi:10.1109/mic.2003.1167344.
MapReduce Tutorial. (n.d.). Retrieved November 29, 2016, from https://hadoop.apache.org/docs/r2.7.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
MemSQL. (n.d.). How pinterest measures real-time user engagement with spark. Retrieved November 29, 2016, from http://blog.memsql.com/pinterest-apache-spark-use-case/
Microsoft Azure. (n.d.-a). HDInsight-Hadoop, Spark, and R Solutions for the Cloud/Microsoft Azure. Retrieved November 29, 2016, from https://azure.microsoft.com/en-us/services/hdinsight
Microsoft Azure. (n.d.-b). Cloud computing platform and services. Retrieved November 29, 2016, from https://azure.microsoft.com/
MityLytics. (n.d.). High performance analytics at scale. Retrieved November 29, 2016, from https://mitylytics.com/
Netflix. (n.d.-a). Kafka inside keystone pipeline. Retrieved November 29, 2016, from http://techblog.netflix.com/2016/04/kafka-inside-keystone-pipeline.html
Netflix. (n.d.-b). Netflix Billing Migration to AWS–Part II. Retrieved November 29, 2016, from http://techblog.netflix.com/2016/07/netflix-billing-migration-to-aws-part-ii.html
Nutanix–The Enterprise Cloud Company. (n.d.). Retrieved November 29, 2016, from http://www.nutanix.com/
O’Malley, O. (2008, May). TeraByte Sort on Apache Hadoop. Retrieved November 29, 2016, from http://sortbenchmark.org/YahooHadoop.pdf
Overview/Apache Phoenix. (n.d.). Retrieved November 29, 2016, from http://phoenix.apache.org/
Performance without Compromise/Internap. (n.d.). Retrieved November 29, 2016, from http://www.internap.com/
Platform as a Service. (n.d.). Retrieved November 29, 2016, from https://en.wikipedia.org/wiki/Platform_as_a_service
Premium Bare Metal Servers and Container Hosting–Packet. (n.d.). Retrieved November 29, 2016, from http://www.packet.net/
Real-Time Data Warehouse. (n.d.). Retrieved November 29, 2016, from http://www.memsql.com/
ScaleIO|Software-Defined Block Storage/EMC. (n.d.). Retrieved November 29, 2016, from http://www.emc.com/storage/scaleio/index.htm
SoftLayer|cloud Servers, Storage, Big Data, and more IAAS Solutions. (n.d.). Retrieved November 29, 2016, from http://www.softlayer.com/
Software-Defined Compute–Factsheet–IDC_P10666. (2005). August 31, 2016, https://www.idc.com/getdoc.jsp?containerId=IDC_P10666
Software-Defined Networking (SDN) Definition. (n.d.). Retrieved November 29, 2016, from https://www.opennetworking.org/sdn-resources/sdn-definition
Spark Streaming/Apache Spark. (n.d.). Retrieved November 29, 2016, from https://spark.apache.org/streaming/
TPC-DS–Homepage. (n.d.). Retrieved November 29, 2016, from http://www.tpc.org/tpcds/default.asp
VansonBourne. (2015). The state of big data infrastructure: benchmarking global big data users to drive future performance. Retrieved August 23, 2016, from http://www.ca.com/content/dam/ca/us/files/industry-analyst-report/the-state-of-big-datainfrastructure.pdf
Virtual Storage: Software defined storage array and hyper-converged solutions. (n.d.). Retrieved November 29, 2016, from https://www.hpe.com/us/en/storage/storevirtual.html
Welcome to Apache Flume. (n.d.). Retrieved November 29, 2016, from https://flume.apache.org/.
Welcome to Apache Pig! (n.d.). Retrieved November 29, 2016, from https://pig.apache.org/
Wilson, R. (2015). Big data needs a new type of non-volatile memory. Retrieved November 29, 2016, from http://www.electronicsweekly.com/news/big-data-needs-a-new-type-of-non-volatile-memory-2015-10/
World fastest NoSQL Database. (n.d.). Retrieved November 29, 2016, from http://www.scylladb.com/
Xia, F., Lang, L. T., Wang, L., & Vinel, A. (2012). Internet of things. International Journal of Communication Systems, 25, 1101–1102. doi:10.1002/dac.2417.
Article Google Scholar

Download references

Author information

Authors and Affiliations

MityLytics Inc., Alameda, CA, 94502, USA
Rishi Divate, Sankalp Sah & Manish Singh

Authors

Rishi Divate
View author publications
You can also search for this author in PubMed Google Scholar
Sankalp Sah
View author publications
You can also search for this author in PubMed Google Scholar
Manish Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manish Singh .

Editor information

Editors and Affiliations

Jesse H. Jones School of Business, Texas Southern University , Houston, Texas, USA
S. Srinivasan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Divate, R., Sah, S., Singh, M. (2018). High Performance Computing and Big Data. In: Srinivasan, S. (eds) Guide to Big Data Applications. Studies in Big Data, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-319-53817-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-53817-4_6
Published: 27 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-53816-7
Online ISBN: 978-3-319-53817-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics