Skip to main content

Performance Evaluation of Spark SQL for Batch Processing

  • Conference paper
  • First Online:
Emerging Research in Data Engineering Systems and Computer Communications

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1054))

Abstract

Now-a-days, large amount of data is being generated at various organizations. In many organizations, there is an inefficiency of handling Big Data with higher volumes, velocity, and variety. Though data is a huge resource, organizing Big Data is a huge challenge in present days. Currently, number of companies adopted different types of NoSQL databases like Cassandra, MongoDB, HBase, etc., which can handle number of requests at a time. To process the Big Data, Apache Spark, one of the most powerful processing engines, has a number of benefits. The main programming notion in Apache Spark is Resilient Distributed Datasets (RDDs), which handles only procedural processing. However, the most regular data processing paradigms are relational queries which cannot be handled by RDD. To overcome this, there is a need to use several higher-level libraries on Apache Spark. Spark SQL is one of the novel components in Apache Spark Framework that integrates relational processing through Apache Spark’s functional programming API. It allows Apache Spark programmers to use the benefits of relational processing. It also provides an integration of relational processing and procedural processing using a declarative Data Frame API. Hence, in this study, Spark SQL Data Frames are experimented to enhance the processing of weather data stored in Cassandra database. Further, the study has proved that the Spark SQL Data Frames have outperformed performance than Spark Core RDD which we have experimented earlier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. https://www.sas.com/en_in/insights/analytics/big-data-analytics.html

  2. https://www.researchgate.net/publication/310613994_Apache_Spark_A_unified_engine_for_big_data_processing

  3. https://insidebigdata.com/2015/11/30/an-overview-of-ApacheSpark-sql/

  4. https://dzone.com/articles/analytics-with-apache-spark-tutorial-part-2-ApacheSpark

  5. Apache Cassandra [Online]. Available: https://www.datastax.com/wp-content/uploads/2012/09/WPDataStax-HDFSvsCFS.pdf

  6. Anusha, K., UshaRani, K.: Big data techniques for efficient storage and processing of weather data. Int. J. Res. Appl. Sci. Eng. Technol. (IJRASET) 5(VII) 2017. ISSN: 2321-9653

    Google Scholar 

  7. https://www.toptal.com/ApacheSpark/introduction-to-apache-spark

  8. Bhutkar, B.: Data Management using Apache Cassandra. SAS Research and Development (India) Pvt. Ltd

    Google Scholar 

  9. https://www.researchgate.net/publication/304850049_Query_processing_in_multistore_systems_an_overview

  10. Xin, R., Zaharia, M.: Lessons from running large-scale Spark workloads. http://tinyurl.com/largescale-spark

  11. Bhattacharya, A., Bhatnagar, S.: Big data and apache spark: a review. Int. J. Eng. Res. Sci. 2(5), 206–210 (2016)

    Google Scholar 

  12. https://es.coursera.org/lecture/scala-spark-big-data/spark-sql-NlNqx

  13. https://www.youtube.com/watch?v=4noellXBRA8

  14. https://www.youtube.com/watch?v=S6jtHLr6UNs

  15. https://www.semanticscholar.org/paper/Spark-SQL%3A-Relational-Data-Processing-in-Spark-Armbrust-Xin/080ed793c12d97436ae29851b5e34c54c07e3816

  16. https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/

  17. https://intellipaat.com/blog/what-is-Spark-sql/

  18. https://www.simplilearn.com/running-sql-queries-using-spark-sql-tutorial-video

  19. https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-baseddatasets/quality-controlledlocal-climatological-data-qclcd

  20. NCDC weather data [online]. Available: https://www.ncdc.noaa.gov/orders/qclcd/

  21. https://www.slideshare.net/databricks/large-scaleApacheSparktalk

  22. Anusha, K., Usha Rani, K., Lakshmi, C.: A survey on big data techniques. Special Issue on Computational Science, Mathematics and Biology IJCSME- SCSMB-16March-2016, ISSN-23498439

    Google Scholar 

  23. https://www.youtube.com/watch?v=Mxw6QZk1CMY

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Anusha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Anusha, K., Usha Rani, K. (2020). Performance Evaluation of Spark SQL for Batch Processing. In: Venkata Krishna, P., Obaidat, M. (eds) Emerging Research in Data Engineering Systems and Computer Communications. Advances in Intelligent Systems and Computing, vol 1054. Springer, Singapore. https://doi.org/10.1007/978-981-15-0135-7_13

Download citation

Publish with us

Policies and ethics