Skip to main content
Log in

Real-Time Big Data Stream Processing Using GPU with Spark Over Hadoop Ecosystem

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

In this technological era, every person, authorities, entrepreneurs, businesses, and many things around us are connected to the internet, forming Internet of thing (IoT). This generates a massive amount of diverse data with very high-speed, termed as big data. However, this data is very useful that can be used as an asset for the businesses, organizations, and authorities to predict future in various aspects. However, efficiently processing Big Data while making real-time decisions is a quite challenging task. Some of the tools like Hadoop are used for Big Datasets processing. On the other hand, these tools could not perform well in the case of real-time high-speed stream processing. Therefore, in this paper, we proposed an efficient and real-time Big Data stream processing approach while mapping Hadoop MapReduce equivalent mechanism on graphics processing units (GPUs). We integrated a parallel and distributed environment of Hadoop ecosystem and a real-time streaming processing tool, i.e., Spark with GPU to make the system more powerful in order to handle the overwhelming amount of high-speed streaming. We designed a MapReduce equivalent algorithm for GPUs for a statistical parameter calculation by dividing overall Big Data files into fixed-size blocks. Finally, the system is evaluated while considering the efficiency aspect (processing time and throughput) using (1) large-size city traffic video data captured by static as well as moving vehicles’ cameras while identifying vehicles and (2) large text-based files, like twitter data files, structural data, etc. Results show that the proposed system working with Spark on top and GPUs under the parallel and distributed environment of Hadoop ecosystem is more efficient and real-time as compared to existing standalone CPU-based MapReduce implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for Big Data. Proc. VLDB Endow. 2(2), 1481–1492 (2009)

    Article  Google Scholar 

  2. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  3. IBM, Armonk, NY, USA.: Four Vendor Views on Big Data and Big Data Analytics. IBM [Online]. http://www-Ol.ibm.comlsoftware/in/data/bigdata/ (2012)

  4. CISCO.: The Internet of Things, Infographic. http://blogs.cisco.com/news/the-internet-of-things-infographic/ (2015)

  5. Sivaraman, S., Trivedi, M.M.: Integrated lane and vehicle detection, localization, and tracking: a synergistic approach. IEEE Trans. Intell. Transp. Syst. 14(2), 906–917 (2013)

    Article  Google Scholar 

  6. Rathore, M.M., Ahmad, A., Paul, A., Jeon, G.: Efficient graph-oriented smart transportation using internet of things generated Big Data. In: 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 512–519 (2015)

  7. Ahmad, A., Paul, A., Rathore, M.M., Chang, H.: Smart cyber society: integration of capillary devices with high usability based on cyber-physical system. Future Gen. Comput. Syst. 56, 493–503 (2016)

    Article  Google Scholar 

  8. Rathore, M.M., Ahmad, A., Paul, A., Wan, J., Daqiang, Z.: Real-time medical emergency response system: exploiting IoT and Big Data for public health. J. Med. Syst. 40(12), 283 (2016)

    Article  Google Scholar 

  9. Rathore, M.M., Ahmad, A., Paul, A., Rho, S.: Urban planning and building smart cities based on the internet of things using Big Data analytics. Comput. Netw. 101, 63–80 (2016)

    Article  Google Scholar 

  10. Ahmad, A., Paul, A., Rathore, M.M.: An efficient divide-and-conquer approach for Big Data analytics in machine-to-machine communication. Neurocomputing 174, 439–453 (2016)

    Article  Google Scholar 

  11. Jin, J., Gubbi, J., Marusic, S., Palaniswami, M.: An information framework for creating a smart city through internet of things. IEEE Internet Things J. 1(2), 112–121 (2014)

    Article  Google Scholar 

  12. Apache Hadoop.: Welcome to Apache™ Hadoop®!. http://hadoop.apache.org/ (2016). Accessed 1 Nov 2016

  13. Apache SPARK.: Apache Spark™. http://spark.apache.org/ (2016). Accessed 1 Nov 2016

  14. Ailamaki, A., Govindaraju, N.K., Harizopoulos, S., Manocha, D.: Query co-processing on commodity processors. VLDB 6, 1267–1267 (2006)

    Google Scholar 

  15. Hadoop.: http://ati.amd.com/technology/streamcomputing/ (2010). Accessed 1 Nov 2016

  16. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: IEEE 13th International Symposium on High Performance Computer Architecture 2007. HPCA 2007, pp. 13–24 (2007)

  17. Cerotti, D., et al.: Modeling and analysis of performances for concurrent multithread applications on multicore and graphics processing unit systems. Concurr. Comput. Pract. Exp. 28(2), 438–452 (2016)

    Article  MathSciNet  Google Scholar 

  18. Qureshi, M.K., Patt, Y.N.: Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: Microarchitecture. 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on IEEE (2006)

  19. Kavadias, S.G. et al.: On-chip communication and synchronization mechanisms with cache-integrated network interfaces. In: Proceedings of the 7th ACM International Conference on Computing Frontiers. ACM (2010)

  20. Liu, F., Xiaowei J., Solihin, Y.: Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. In: High Performance Computer Architecture (HPCA). 2010 IEEE 16th International Symposium on IEEE (2010)

  21. D’Amore, L., et al.: HPC computation issues of the incremental 3D variational data assimilation scheme in OceanVar software. J. Numer. Anal. Ind. Appl. Math. 7(3–4), 91–105 (2012)

    MathSciNet  MATH  Google Scholar 

  22. Che, S., et al.: A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68(10), 1370–1380 (2008)

    Article  Google Scholar 

  23. Owens, J.D., et al.: GPU computing. Proc. IEEE 96(5), 879–899 (2008)

    Article  Google Scholar 

  24. Gregg, C., Hazelwood K.: Where is the data? Why you cannot debate CPU versus GPU performance without the answer. In: Performance Analysis of Systems and Software (ISPASS), 2011 IEEE International Symposium on IEEE (2011)

  25. Shi, L., et al.: vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Trans. Comput. 61(6), 804–816 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  26. Aldinucci, M., et al.: Parallel visual data restoration on multi-GPGPUs using stencil-reduce pattern. Int. J. High Perform. Comput. Appl. 29(4), 461–472 (2015)

    Article  Google Scholar 

  27. Wu, W., et al.: Hierarchical dag scheduling for hybrid distributed systems. In: Parallel and Distributed Processing Symposium (IPDPS), 2015 International IEEE (2015)

  28. Song, F., Dongarra, J.: A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems. Concurr. Comput. Pract. Exp. 27(14), 3702–3723 (2015)

    Article  Google Scholar 

  29. Du, P., et al.: Soft error resilient QR factorization for hybrid system with GPGPU. J. Comput. Sci. 4(6), 457–464 (2013)

    Article  Google Scholar 

  30. Dongarra, J., et al.: Hpc programming on intel many-integrated-core hardware with magma port to xeon phi. Sci. Program. 2015, 9 (2015)

    Google Scholar 

  31. Braun, T.D., et al.: A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 61(6), 810–837 (2001)

    Article  MATH  Google Scholar 

  32. Anderson, E., et al.: LAPACK Users’ guide. In: Society for Industrial and Applied Mathematics (1999)

  33. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd edn. SIAM, Philadelphia (1999)

    Book  MATH  Google Scholar 

  34. Agullo, E., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Langou, J., Ltaief, H., Luszczek, P., YarKhan, A.: Plasma Users’ Guide, Technical report. In: ICL, UTK (2014)

  35. Blackford, L.S., Choi, J., Cleary, A., D’Azeuedo, E., Demmel, J., Dhillon, I., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK User’s Guide. In: Society for Industrial and Applied Mathematics, Philadelphia (1997)

  36. Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. pp. 1–11 (2009)

  37. Ahmad, A., et al.: Multilevel data processing using parallel algorithms for analyzing Big Data in high-performance computing. Int. J. Parallel Program. doi:10.1007/s10766-017-0498-x (2017)

  38. Rathore, M.M., et al.: Exploiting encrypted and tunneled multimedia calls in high-speed Big Data environment. Multimed. Tools Appl. doi:10.1007/s11042-017-4393-7 (2017)

  39. NVIDIA ACCELERATED COMPUTING.: CUDA Toolkit 8.0. https://developer.nvidia.com/cuda-downloads (2016). Accessed 1 Nov 2016

  40. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Proceedings of Sixth Conference Symposium on Opearting Systems Design and Implementation (OSDI) (2004)

  41. Arlingtonva.us.: Live traffic cameras. https://transportation.arlingtonva.us/live-traffic-cameras/ (2016). Accessed 1 Nov 2016

  42. 43Earth Cam.: LIVE Webcam Network. http://www.earthcam.com/ (2016). Accessed 1 Nov 2016

Download references

Acknowledgements

This study was supported by the Brain Korea 21 Plus project (SW Human Resource Development Program for Supporting Smart Life) funded by Ministry of Education, School of Computer Science and Engineering, Kyungpook National University, Korea (21A20131600005).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anand Paul.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rathore, M.M., Son, H., Ahmad, A. et al. Real-Time Big Data Stream Processing Using GPU with Spark Over Hadoop Ecosystem. Int J Parallel Prog 46, 630–646 (2018). https://doi.org/10.1007/s10766-017-0513-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-017-0513-2

Keywords

Navigation