Skip to main content

SCSI: Real-Time Data Analysis with Cassandra and Spark

  • Chapter
  • First Online:
Big Data Processing Using Spark in Cloud

Part of the book series: Studies in Big Data ((SBD,volume 43 ))

Highlights

  • The open-source framework for stream processing and enormous information

  • In-memory handling model executed with the machine learning algorithms

  • The data used in subset of non-distributed mode is better than using all data in distributed mode

  • The Apache Spark platform handles big data sets with immaculate parallel speedup.

Abstract The dynamic progress in the nature of pervasive computing datasets has been main motivation for development of the NoSQL model. The devices having capability of executing “Internet of Things” (IoT) concepts are producing massive amount of data in various forms (structured and unstructured). To handle this IoT data with traditional database schemes is impracticable and expensive. The large-scale unstructured data required as the prerequisites for a preparing pipeline, which flawlessly consolidating the NoSQL storage model such as Apache Cassandra and a Big Data processing platform such as Apache Spark. The Apache Spark is the data-intensive computing paradigm, which allows users to write the applications in various high-level programming languages including Java, Scala, R, Python, etc. The Spark Streaming module receives live input data streams and divides that data into batches by using the Map and Reduce operations. This research presents a novel and scalable approaches called "Smart Cassandra Spark Integration (SCSI)” for solving the challenge of integrating NoSQL data stores like Apache Cassandra with Apache Spark to manage distributed systems based on varied platter of amalgamation of current technologies, IT enabled devices, etc., while eliminating complexity and risk. In this chapter, for performance evaluations, SCSI Streaming framework is compared with the file system-based data stores such as Hadoop Streaming framework. SCSI framework proved scalable, efficient, and accurate while computing big streams of IoT data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ray, P.: A survey of IoT cloud platforms. Future Comput. Inform. J. 1(1–2), 35–46 (2016)

    Article  Google Scholar 

  2. UMassTraceRepository. http://traces.cs.umass.edu/index.php/Smart/Smart

  3. National energy research scientific computing center. http://www.nersc.gov

  4. Apache Spark. http://spark.apache.org

  5. Chaudhari, A.A., Khanuja, H.K.: Extended SQL aggregation for database. Int. J. Comput. Trends Technol. (IJCTT) 18(6), 272–275 (2014)

    Article  Google Scholar 

  6. Lakshman, A., Malik P.: Cassandra: structured storage system on a p2p network. In Proceeding of the 28th ACM Symposium Principles of Distributed Computing, New York, NY, USA, pp. 1–5 (2009)

    Google Scholar 

  7. Cassandra wiki, operations. http://wiki.apache.org/cassandra/Operations

  8. Dede, E., Sendir, B., Kuzlu, P., Hartog, J., Govindaraju, M.: An evaluation of cassandra for Hadoop. In Proceedings of the IEEE 6th International Conference Cloud Computing, Washington, DC, USA, pp. 494–501 (2013)

    Google Scholar 

  9. Apache Hadoop. http://hadoop.apache.org

  10. Premchaiswadi, W., Walisa, R., Sarayut, I., Nucharee, P.: Applying Hadoop’s MapReduce framework on clustering the GPS signals through cloud computing. In: International Conference on High Performance Computing and Simulation (HPCS), pp. 644–649 (2013)

    Google Scholar 

  11. Dede, E., Sendir, B., Kuzlu, P., Weachock, J., Govindaraju, M., Ramakrishnan, L.: Processing Cassandra Datasets with Hadoop-Streaming Based Approaches. IEEE Trans. Server Comput. 9(1), 46–58 (2016)

    Google Scholar 

  12. Acharjya, D., Ahmed, K.P.: A survey on big data analytics: challenges, open research issues and tools. Int. J. Adv. Comput. Sci. Appl. 7, 511–518 (2016)

    Google Scholar 

  13. Karau, H.: Fast Data Processing with Spark. Packt Publishing Ltd. (2013)

    Google Scholar 

  14. Sakr, S.: Chapter 3: General-purpose big data processing systems. In: Big Data 2.0 Processing Systems. Springer, pp. 15–39 (2016)

    Google Scholar 

  15. Chen, J., Li, K., Tang, Z., Bilal, K.: A parallel random forest algorithm for big data in a Spark Cloud Computing environment. IEEE Trans. Parallel Distrib. Syst. 28(4), 919–933 (2017)

    Article  Google Scholar 

  16. Sakr, S.: Big data 2.0 processing systems: a survey. Springer Briefs in Computer Science (2016)

    Google Scholar 

  17. Azarmi, B.: Chapter 4: The big (data) problem. In: Scalable Big Data Architecture, Springer, pp. 1–16 (2016)

    Google Scholar 

  18. Scala programming language. http://www.scala-lang.org

  19. Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2.1 (2015)

    Google Scholar 

  20. Wadkar, S., Siddalingaiah, M.: Apache Ambari. In: Pro Apache Hadoop, pp. 399–401. Springer (2014)

    Google Scholar 

  21. Kalantari, A., Kamsin, A., Kamaruddin, H., Ebrahim, N., Ebrahimi, A., Shamshirband, S.: A bibliometric approach to tracking big data research trends. J. Big Data, 1–18 (2017)

    Google Scholar 

Web References

  1. Belissent, J.: Chapter 5: Getting clever about smart cities: new opportunities require new business models. Forrester Research (2010)

    Google Scholar 

  2. Huang, W., Meng, L., Zhang, D., Zhang, W.: In-memory parallel processing of massive remotely sensed data using an Apache Spark on Hadoop YARN model. IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 10(1), 3–19 (2017)

    Article  Google Scholar 

  3. Soumaya, O., Mohamed, T., Soufiane, A., Abderrahmane, D., Mohamed, A.: Real-time data stream processing-challenges and perspectives. Int. J. Comput. Sci. Issues 14(5), 6–12 (2017)

    Article  Google Scholar 

  4. Chaudhari, A.A., Khanuja, H.K.: Database transformation to build data-set for data mining analysis—a review. In: 2015 International Conference on Computing Communication Control and Automation (IEEE Digital library), pp. 386–389 (2015)

    Google Scholar 

  5. DataStax Enterprise. http://www.datastax.com/what-we-offer/products-services/datastax-enterprise

  6. Blake, C.L., Merz, C.J.: UCI repository of machine learning database. Department of Information and Computer Science, University of California, Irvine, CA (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html

  7. Sundmaeker, H., Guillemin, P., Friess, P., Woelfflé, S.: Vision and challenges for realizing the Internet of Things. In: CERP-IoT-Cluster of European Research Projects on the Internet of Things (2010)

    Google Scholar 

Additional References

  1. Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using Hadoop. In Proceedings of the IEEE 26th International Conference Data Engineering, pp. 996–1005 (2010)

    Google Scholar 

  2. Yang, C., Yen, C., Tan, C., Madden S.R.: Osprey: implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In: Proceedings of the IEEE 26th International Conference on Data Engineering, pp. 657–668 (2010)

    Google Scholar 

  3. Kaldewey, T., Shekita, E.J., Tata, S.,: Clydesdale: structured data processing on MapReduce. In Proceedings of the 15th International Conference on Extending Database Technology, New York, NY, USA, pp. 15–25 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Preeti Mulay .

Editor information

Editors and Affiliations

Annexure

Annexure

How to Install Spark with Cassandra

The following steps describe how to set up a server with both a Spark node and a Cassandra node (Spark and Cassandra will both be running on localhost). There are two ways for setting up a Spark and Cassandra server: if you have DataStax Enterprise [3] then you can simply install an Analytics Node and check off the box for Spark or, if you are using the open source version, then you will need to follow these steps.

This assumes you already have Cassandra setup.

  1. 1.

    Download and setup Spark

    1. i.

      Go to http://spark.apache.org/downloads.html.

    2. ii.

      Choose Spark version 2.2.0 and “Pre-built for Hadoop 2.4” then Direct Download. This will download an archive with the built binaries for Spark.

    3. iii.

      Extract this to a directory of your choosing: Ex. ~/apps/spark-1.2

    4. iv.

      Test Spark is working by opening the Shell

  2. 2.

    Test that Spark Works

    1. i.

      cd into the Spark directory

    2. ii.

      Run “./bin/spark-shell”. This will open up the Spark interactive shell program

    3. iii.

      If everything worked it should display this prompt: “Scala>”

    4. iv.

      Run a simple calculation: Ex. sc.parallelize(1 to 100).sum(_+_)

    5. v.

      Exit the Spark shell with the command “exit”

The Spark Cassandra Connector

To connect Spark to a Cassandra cluster, the Cassandra Connector will need to be added to the Spark project. DataStax provides their own Cassandra Connector on GitHub and we will use that.

  1. 1.

    Clone the Spark Cassandra Connector repository: https://github.com/datastax/spark-cassandra-connector

  2. 2.

    cd into “spark-Cassandra-connector”

  3. 3.

    Build the Spark Cassandra Connector

    1. i.

      Execute the command “./sbt/sbt assembly”

    2. ii.

      This should output compiled jar files to the directory named “target”. There will be two jar files, one for Scala and one for Java.

    3. iii.

      The jar we are interested in is “spark-cassandra-connector-assembly-1.1.1-SNAPSHOT.jar” the one for Scala.

  4. 4.

    Move the jar file into an easy-to-find directory: ~/apps/spark-1.2/jars

To Load the Connector into the Spark Shell

  1. 1.

    start the shell with this command:

    ../bin/spark-shell–jars~/apps/spark-1.2/jars/spark-cassandra-connector-assembly-1.1.1-SNAPSHOT.jar

  2. 2.

    Connect the Spark Context to the Cassandra cluster:

    1. i.

      Stop the default context: sc.stop

    2. ii.

      Import the necessary jar files:importcom.datastax.spark.connector._, import org.apache.spark.SparkContext, import org.apache.spark.SparkContext._, import org.apache.spark.SparkConf

    3. iii.

      Make a new SparkConf with the Cassandra connection details:Valcone=new SparkConf (true).set (“spark.cassandra.connection.host”, “localhost”)

    4. iv.

      Create a new Spark Context:valsc=new SparkContext(conf)

  3. 3.

    You now have a new SparkContext which is connected to your Cassandra cluster.

  4. 4.

    From the Spark Shell run the following commands:

    1. i.

      valtest_spark_rdd=sc.cassandraTable(“test_spark”, “test”)

    2. ii.

      test_spark_rdd.first

    3. iii.

      The predicted output generated

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Chaudhari, A.A., Mulay, P. (2019). SCSI: Real-Time Data Analysis with Cassandra and Spark. In: Mittal, M., Balas, V., Goyal, L., Kumar, R. (eds) Big Data Processing Using Spark in Cloud. Studies in Big Data, vol 43 . Springer, Singapore. https://doi.org/10.1007/978-981-13-0550-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-0550-4_11

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-0549-8

  • Online ISBN: 978-981-13-0550-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics