SCSI: Real-Time Data Analysis with Cassandra and Spark

Chaudhari, Archana A.; Mulay, Preeti

doi:10.1007/978-981-13-0550-4_11

Archana A. Chaudhari⁶ &
Preeti Mulay⁷

Part of the book series: Studies in Big Data ((SBD,volume 43 ))

2331 Accesses
12 Citations

Highlights

The open-source framework for stream processing and enormous information
In-memory handling model executed with the machine learning algorithms
The data used in subset of non-distributed mode is better than using all data in distributed mode
The Apache Spark platform handles big data sets with immaculate parallel speedup.

Abstract The dynamic progress in the nature of pervasive computing datasets has been main motivation for development of the NoSQL model. The devices having capability of executing “Internet of Things” (IoT) concepts are producing massive amount of data in various forms (structured and unstructured). To handle this IoT data with traditional database schemes is impracticable and expensive. The large-scale unstructured data required as the prerequisites for a preparing pipeline, which flawlessly consolidating the NoSQL storage model such as Apache Cassandra and a Big Data processing platform such as Apache Spark. The Apache Spark is the data-intensive computing paradigm, which allows users to write the applications in various high-level programming languages including Java, Scala, R, Python, etc. The Spark Streaming module receives live input data streams and divides that data into batches by using the Map and Reduce operations. This research presents a novel and scalable approaches called "Smart Cassandra Spark Integration (SCSI)” for solving the challenge of integrating NoSQL data stores like Apache Cassandra with Apache Spark to manage distributed systems based on varied platter of amalgamation of current technologies, IT enabled devices, etc., while eliminating complexity and risk. In this chapter, for performance evaluations, SCSI Streaming framework is compared with the file system-based data stores such as Hadoop Streaming framework. SCSI framework proved scalable, efficient, and accurate while computing big streams of IoT data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The big data system, components, tools, and technologies: a survey

Article 18 September 2018

Evaluating New Approaches of Big Data Analytics Frameworks

Big data analytics on Apache Spark

Article 13 October 2016

References

Ray, P.: A survey of IoT cloud platforms. Future Comput. Inform. J. 1(1–2), 35–46 (2016)
Article Google Scholar
UMassTraceRepository. http://traces.cs.umass.edu/index.php/Smart/Smart
National energy research scientific computing center. http://www.nersc.gov
Apache Spark. http://spark.apache.org
Chaudhari, A.A., Khanuja, H.K.: Extended SQL aggregation for database. Int. J. Comput. Trends Technol. (IJCTT) 18(6), 272–275 (2014)
Article Google Scholar
Lakshman, A., Malik P.: Cassandra: structured storage system on a p2p network. In Proceeding of the 28th ACM Symposium Principles of Distributed Computing, New York, NY, USA, pp. 1–5 (2009)
Google Scholar
Cassandra wiki, operations. http://wiki.apache.org/cassandra/Operations
Dede, E., Sendir, B., Kuzlu, P., Hartog, J., Govindaraju, M.: An evaluation of cassandra for Hadoop. In Proceedings of the IEEE 6th International Conference Cloud Computing, Washington, DC, USA, pp. 494–501 (2013)
Google Scholar
Apache Hadoop. http://hadoop.apache.org
Premchaiswadi, W., Walisa, R., Sarayut, I., Nucharee, P.: Applying Hadoop’s MapReduce framework on clustering the GPS signals through cloud computing. In: International Conference on High Performance Computing and Simulation (HPCS), pp. 644–649 (2013)
Google Scholar
Dede, E., Sendir, B., Kuzlu, P., Weachock, J., Govindaraju, M., Ramakrishnan, L.: Processing Cassandra Datasets with Hadoop-Streaming Based Approaches. IEEE Trans. Server Comput. 9(1), 46–58 (2016)
Google Scholar
Acharjya, D., Ahmed, K.P.: A survey on big data analytics: challenges, open research issues and tools. Int. J. Adv. Comput. Sci. Appl. 7, 511–518 (2016)
Google Scholar
Karau, H.: Fast Data Processing with Spark. Packt Publishing Ltd. (2013)
Google Scholar
Sakr, S.: Chapter 3: General-purpose big data processing systems. In: Big Data 2.0 Processing Systems. Springer, pp. 15–39 (2016)
Google Scholar
Chen, J., Li, K., Tang, Z., Bilal, K.: A parallel random forest algorithm for big data in a Spark Cloud Computing environment. IEEE Trans. Parallel Distrib. Syst. 28(4), 919–933 (2017)
Article Google Scholar
Sakr, S.: Big data 2.0 processing systems: a survey. Springer Briefs in Computer Science (2016)
Google Scholar
Azarmi, B.: Chapter 4: The big (data) problem. In: Scalable Big Data Architecture, Springer, pp. 1–16 (2016)
Google Scholar
Scala programming language. http://www.scala-lang.org
Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2.1 (2015)
Google Scholar
Wadkar, S., Siddalingaiah, M.: Apache Ambari. In: Pro Apache Hadoop, pp. 399–401. Springer (2014)
Google Scholar
Kalantari, A., Kamsin, A., Kamaruddin, H., Ebrahim, N., Ebrahimi, A., Shamshirband, S.: A bibliometric approach to tracking big data research trends. J. Big Data, 1–18 (2017)
Google Scholar

Web References

Belissent, J.: Chapter 5: Getting clever about smart cities: new opportunities require new business models. Forrester Research (2010)
Google Scholar
Huang, W., Meng, L., Zhang, D., Zhang, W.: In-memory parallel processing of massive remotely sensed data using an Apache Spark on Hadoop YARN model. IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 10(1), 3–19 (2017)
Article Google Scholar
Soumaya, O., Mohamed, T., Soufiane, A., Abderrahmane, D., Mohamed, A.: Real-time data stream processing-challenges and perspectives. Int. J. Comput. Sci. Issues 14(5), 6–12 (2017)
Article Google Scholar
Chaudhari, A.A., Khanuja, H.K.: Database transformation to build data-set for data mining analysis—a review. In: 2015 International Conference on Computing Communication Control and Automation (IEEE Digital library), pp. 386–389 (2015)
Google Scholar
DataStax Enterprise. http://www.datastax.com/what-we-offer/products-services/datastax-enterprise
Blake, C.L., Merz, C.J.: UCI repository of machine learning database. Department of Information and Computer Science, University of California, Irvine, CA (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html
Sundmaeker, H., Guillemin, P., Friess, P., Woelfflé, S.: Vision and challenges for realizing the Internet of Things. In: CERP-IoT-Cluster of European Research Projects on the Internet of Things (2010)
Google Scholar

Additional References

Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using Hadoop. In Proceedings of the IEEE 26th International Conference Data Engineering, pp. 996–1005 (2010)
Google Scholar
Yang, C., Yen, C., Tan, C., Madden S.R.: Osprey: implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In: Proceedings of the IEEE 26th International Conference on Data Engineering, pp. 657–668 (2010)
Google Scholar
Kaldewey, T., Shekita, E.J., Tata, S.,: Clydesdale: structured data processing on MapReduce. In Proceedings of the 15th International Conference on Extending Database Technology, New York, NY, USA, pp. 15–25 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Symbiosis International (Deemed University), Pune, India
Archana A. Chaudhari
Department of CS, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
Preeti Mulay

Authors

Archana A. Chaudhari
View author publications
You can also search for this author in PubMed Google Scholar
Preeti Mulay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Preeti Mulay .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, GB Pant Government Engineering College, New Delhi, India
Mamta Mittal
Department of Automation and Applied Informatics, Aurel Vlaicu University of Arad, Arad, Romania
Valentina E. Balas
Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India
Lalit Mohan Goyal
Department of Computer Science and Engineering, Laxmi Narayan College of Technology, Jabalpur, Madhya Pradesh, India
Raghvendra Kumar

Annexure

How to Install Spark with Cassandra

The following steps describe how to set up a server with both a Spark node and a Cassandra node (Spark and Cassandra will both be running on localhost). There are two ways for setting up a Spark and Cassandra server: if you have DataStax Enterprise [3] then you can simply install an Analytics Node and check off the box for Spark or, if you are using the open source version, then you will need to follow these steps.

This assumes you already have Cassandra setup.

1.
Download and setup Spark
1. i.
  Go to http://spark.apache.org/downloads.html.
2. ii.
  Choose Spark version 2.2.0 and “Pre-built for Hadoop 2.4” then Direct Download. This will download an archive with the built binaries for Spark.
3. iii.
  Extract this to a directory of your choosing: Ex. ~/apps/spark-1.2
4. iv.
  Test Spark is working by opening the Shell
2.
Test that Spark Works
1. i.
  cd into the Spark directory
2. ii.
  Run “./bin/spark-shell”. This will open up the Spark interactive shell program
3. iii.
  If everything worked it should display this prompt: “Scala>”
4. iv.
  Run a simple calculation: Ex. sc.parallelize(1 to 100).sum(_+_)
5. v.
  Exit the Spark shell with the command “exit”

The Spark Cassandra Connector

To connect Spark to a Cassandra cluster, the Cassandra Connector will need to be added to the Spark project. DataStax provides their own Cassandra Connector on GitHub and we will use that.

1.
Clone the Spark Cassandra Connector repository: https://github.com/datastax/spark-cassandra-connector
2.
cd into “spark-Cassandra-connector”
3.
Build the Spark Cassandra Connector
1. i.
  Execute the command “./sbt/sbt assembly”
2. ii.
  This should output compiled jar files to the directory named “target”. There will be two jar files, one for Scala and one for Java.
3. iii.
  The jar we are interested in is “spark-cassandra-connector-assembly-1.1.1-SNAPSHOT.jar” the one for Scala.
4.
Move the jar file into an easy-to-find directory: ~/apps/spark-1.2/jars

To Load the Connector into the Spark Shell

1.
start the shell with this command:

../bin/spark-shell–jars~/apps/spark-1.2/jars/spark-cassandra-connector-assembly-1.1.1-SNAPSHOT.jar
2.
Connect the Spark Context to the Cassandra cluster:
1. i.
  Stop the default context: sc.stop
2. ii.
  Import the necessary jar files:importcom.datastax.spark.connector._, import org.apache.spark.SparkContext, import org.apache.spark.SparkContext._, import org.apache.spark.SparkConf
3. iii.
  Make a new SparkConf with the Cassandra connection details:Valcone=new SparkConf (true).set (“spark.cassandra.connection.host”, “localhost”)
4. iv.
  Create a new Spark Context:valsc=new SparkContext(conf)
3.
You now have a new SparkContext which is connected to your Cassandra cluster.
4.
From the Spark Shell run the following commands:
1. i.
  valtest_spark_rdd=sc.cassandraTable(“test_spark”, “test”)
2. ii.
  test_spark_rdd.first
3. iii.
  The predicted output generated

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chaudhari, A.A., Mulay, P. (2019). SCSI: Real-Time Data Analysis with Cassandra and Spark. In: Mittal, M., Balas, V., Goyal, L., Kumar, R. (eds) Big Data Processing Using Spark in Cloud. Studies in Big Data, vol 43 . Springer, Singapore. https://doi.org/10.1007/978-981-13-0550-4_11

Download citation

DOI: https://doi.org/10.1007/978-981-13-0550-4_11
Published: 17 June 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0549-8
Online ISBN: 978-981-13-0550-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

SCSI: Real-Time Data Analysis with Cassandra and Spark

Highlights

Access this chapter

Similar content being viewed by others

The big data system, components, tools, and technologies: a survey

Evaluating New Approaches of Big Data Analytics Frameworks

Big data analytics on Apache Spark

References

Web References

Additional References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Annexure

How to Install Spark with Cassandra

The Spark Cassandra Connector

To Load the Connector into the Spark Shell

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

SCSI: Real-Time Data Analysis with Cassandra and Spark

Highlights

Access this chapter

Similar content being viewed by others

The big data system, components, tools, and technologies: a survey

Evaluating New Approaches of Big Data Analytics Frameworks

Big data analytics on Apache Spark

References

Web References

Additional References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Annexure

Annexure

How to Install Spark with Cassandra

The Spark Cassandra Connector

To Load the Connector into the Spark Shell

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation