Highlights
-
The open-source framework for stream processing and enormous information
-
In-memory handling model executed with the machine learning algorithms
-
The data used in subset of non-distributed mode is better than using all data in distributed mode
-
The Apache Spark platform handles big data sets with immaculate parallel speedup.
Abstract The dynamic progress in the nature of pervasive computing datasets has been main motivation for development of the NoSQL model. The devices having capability of executing “Internet of Things” (IoT) concepts are producing massive amount of data in various forms (structured and unstructured). To handle this IoT data with traditional database schemes is impracticable and expensive. The large-scale unstructured data required as the prerequisites for a preparing pipeline, which flawlessly consolidating the NoSQL storage model such as Apache Cassandra and a Big Data processing platform such as Apache Spark. The Apache Spark is the data-intensive computing paradigm, which allows users to write the applications in various high-level programming languages including Java, Scala, R, Python, etc. The Spark Streaming module receives live input data streams and divides that data into batches by using the Map and Reduce operations. This research presents a novel and scalable approaches called "Smart Cassandra Spark Integration (SCSI)” for solving the challenge of integrating NoSQL data stores like Apache Cassandra with Apache Spark to manage distributed systems based on varied platter of amalgamation of current technologies, IT enabled devices, etc., while eliminating complexity and risk. In this chapter, for performance evaluations, SCSI Streaming framework is compared with the file system-based data stores such as Hadoop Streaming framework. SCSI framework proved scalable, efficient, and accurate while computing big streams of IoT data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ray, P.: A survey of IoT cloud platforms. Future Comput. Inform. J. 1(1–2), 35–46 (2016)
UMassTraceRepository. http://traces.cs.umass.edu/index.php/Smart/Smart
National energy research scientific computing center. http://www.nersc.gov
Apache Spark. http://spark.apache.org
Chaudhari, A.A., Khanuja, H.K.: Extended SQL aggregation for database. Int. J. Comput. Trends Technol. (IJCTT) 18(6), 272–275 (2014)
Lakshman, A., Malik P.: Cassandra: structured storage system on a p2p network. In Proceeding of the 28th ACM Symposium Principles of Distributed Computing, New York, NY, USA, pp. 1–5 (2009)
Cassandra wiki, operations. http://wiki.apache.org/cassandra/Operations
Dede, E., Sendir, B., Kuzlu, P., Hartog, J., Govindaraju, M.: An evaluation of cassandra for Hadoop. In Proceedings of the IEEE 6th International Conference Cloud Computing, Washington, DC, USA, pp. 494–501 (2013)
Apache Hadoop. http://hadoop.apache.org
Premchaiswadi, W., Walisa, R., Sarayut, I., Nucharee, P.: Applying Hadoop’s MapReduce framework on clustering the GPS signals through cloud computing. In: International Conference on High Performance Computing and Simulation (HPCS), pp. 644–649 (2013)
Dede, E., Sendir, B., Kuzlu, P., Weachock, J., Govindaraju, M., Ramakrishnan, L.: Processing Cassandra Datasets with Hadoop-Streaming Based Approaches. IEEE Trans. Server Comput. 9(1), 46–58 (2016)
Acharjya, D., Ahmed, K.P.: A survey on big data analytics: challenges, open research issues and tools. Int. J. Adv. Comput. Sci. Appl. 7, 511–518 (2016)
Karau, H.: Fast Data Processing with Spark. Packt Publishing Ltd. (2013)
Sakr, S.: Chapter 3: General-purpose big data processing systems. In: Big Data 2.0 Processing Systems. Springer, pp. 15–39 (2016)
Chen, J., Li, K., Tang, Z., Bilal, K.: A parallel random forest algorithm for big data in a Spark Cloud Computing environment. IEEE Trans. Parallel Distrib. Syst. 28(4), 919–933 (2017)
Sakr, S.: Big data 2.0 processing systems: a survey. Springer Briefs in Computer Science (2016)
Azarmi, B.: Chapter 4: The big (data) problem. In: Scalable Big Data Architecture, Springer, pp. 1–16 (2016)
Scala programming language. http://www.scala-lang.org
Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2.1 (2015)
Wadkar, S., Siddalingaiah, M.: Apache Ambari. In: Pro Apache Hadoop, pp. 399–401. Springer (2014)
Kalantari, A., Kamsin, A., Kamaruddin, H., Ebrahim, N., Ebrahimi, A., Shamshirband, S.: A bibliometric approach to tracking big data research trends. J. Big Data, 1–18 (2017)
Web References
Belissent, J.: Chapter 5: Getting clever about smart cities: new opportunities require new business models. Forrester Research (2010)
Huang, W., Meng, L., Zhang, D., Zhang, W.: In-memory parallel processing of massive remotely sensed data using an Apache Spark on Hadoop YARN model. IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 10(1), 3–19 (2017)
Soumaya, O., Mohamed, T., Soufiane, A., Abderrahmane, D., Mohamed, A.: Real-time data stream processing-challenges and perspectives. Int. J. Comput. Sci. Issues 14(5), 6–12 (2017)
Chaudhari, A.A., Khanuja, H.K.: Database transformation to build data-set for data mining analysis—a review. In: 2015 International Conference on Computing Communication Control and Automation (IEEE Digital library), pp. 386–389 (2015)
DataStax Enterprise. http://www.datastax.com/what-we-offer/products-services/datastax-enterprise
Blake, C.L., Merz, C.J.: UCI repository of machine learning database. Department of Information and Computer Science, University of California, Irvine, CA (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html
Sundmaeker, H., Guillemin, P., Friess, P., Woelfflé, S.: Vision and challenges for realizing the Internet of Things. In: CERP-IoT-Cluster of European Research Projects on the Internet of Things (2010)
Additional References
Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using Hadoop. In Proceedings of the IEEE 26th International Conference Data Engineering, pp. 996–1005 (2010)
Yang, C., Yen, C., Tan, C., Madden S.R.: Osprey: implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In: Proceedings of the IEEE 26th International Conference on Data Engineering, pp. 657–668 (2010)
Kaldewey, T., Shekita, E.J., Tata, S.,: Clydesdale: structured data processing on MapReduce. In Proceedings of the 15th International Conference on Extending Database Technology, New York, NY, USA, pp. 15–25 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Annexure
Annexure
How to Install Spark with Cassandra
The following steps describe how to set up a server with both a Spark node and a Cassandra node (Spark and Cassandra will both be running on localhost). There are two ways for setting up a Spark and Cassandra server: if you have DataStax Enterprise [3] then you can simply install an Analytics Node and check off the box for Spark or, if you are using the open source version, then you will need to follow these steps.
This assumes you already have Cassandra setup.
-
1.
Download and setup Spark
- i.
-
ii.
Choose Spark version 2.2.0 and “Pre-built for Hadoop 2.4” then Direct Download. This will download an archive with the built binaries for Spark.
-
iii.
Extract this to a directory of your choosing: Ex. ~/apps/spark-1.2
-
iv.
Test Spark is working by opening the Shell
-
2.
Test that Spark Works
-
i.
cd into the Spark directory
-
ii.
Run “./bin/spark-shell”. This will open up the Spark interactive shell program
-
iii.
If everything worked it should display this prompt: “Scala>”
-
iv.
Run a simple calculation: Ex. sc.parallelize(1 to 100).sum(_+_)
-
v.
Exit the Spark shell with the command “exit”
-
i.
The Spark Cassandra Connector
To connect Spark to a Cassandra cluster, the Cassandra Connector will need to be added to the Spark project. DataStax provides their own Cassandra Connector on GitHub and we will use that.
-
1.
Clone the Spark Cassandra Connector repository: https://github.com/datastax/spark-cassandra-connector
-
2.
cd into “spark-Cassandra-connector”
-
3.
Build the Spark Cassandra Connector
-
i.
Execute the command “./sbt/sbt assembly”
-
ii.
This should output compiled jar files to the directory named “target”. There will be two jar files, one for Scala and one for Java.
-
iii.
The jar we are interested in is “spark-cassandra-connector-assembly-1.1.1-SNAPSHOT.jar” the one for Scala.
-
i.
-
4.
Move the jar file into an easy-to-find directory: ~/apps/spark-1.2/jars
To Load the Connector into the Spark Shell
-
1.
start the shell with this command:
../bin/spark-shell–jars~/apps/spark-1.2/jars/spark-cassandra-connector-assembly-1.1.1-SNAPSHOT.jar
-
2.
Connect the Spark Context to the Cassandra cluster:
-
i.
Stop the default context: sc.stop
-
ii.
Import the necessary jar files:importcom.datastax.spark.connector._, import org.apache.spark.SparkContext, import org.apache.spark.SparkContext._, import org.apache.spark.SparkConf
-
iii.
Make a new SparkConf with the Cassandra connection details:Valcone=new SparkConf (true).set (“spark.cassandra.connection.host”, “localhost”)
-
iv.
Create a new Spark Context:valsc=new SparkContext(conf)
-
i.
-
3.
You now have a new SparkContext which is connected to your Cassandra cluster.
-
4.
From the Spark Shell run the following commands:
-
i.
valtest_spark_rdd=sc.cassandraTable(“test_spark”, “test”)
-
ii.
test_spark_rdd.first
-
iii.
The predicted output generated
-
i.
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Chaudhari, A.A., Mulay, P. (2019). SCSI: Real-Time Data Analysis with Cassandra and Spark. In: Mittal, M., Balas, V., Goyal, L., Kumar, R. (eds) Big Data Processing Using Spark in Cloud. Studies in Big Data, vol 43 . Springer, Singapore. https://doi.org/10.1007/978-981-13-0550-4_11
Download citation
DOI: https://doi.org/10.1007/978-981-13-0550-4_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0549-8
Online ISBN: 978-981-13-0550-4
eBook Packages: EngineeringEngineering (R0)