Abstract
This paper is our survey toward a general understanding of the requirements for handling large volumes of heterogeneous data, and moreover, presents an overview of the employed computing techniques and technologies necessary for analyzing and processing those datasets. As of our attempt to picture how the data heterogeneity meets the systems heterogeneity, we summarize the identified key issues for multiple dimensions, including data, processing, workload, and infrastructure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ahmed S, Usman Ali M, Ferzund J, Atif Sarwar M, Rehman A, Mehmood A (2017) Modern data formats for big bioinformatics data analytics. Int J Adv Comput Sci Appl 8(4):366–377
Avro, Apache Software Foundation. https://avro.apache.org/
Azmandian F, Moffie M, Dy JG, Aslam JA, Kaeli DR (2011) Workload characterization at the virtualization layer. In: 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems, pp 63–72
Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache Flink™: stream and batch processing in a single engine. IEEE Data Eng Bull 38(4):28–38
Casado R, Younas M (2015) Emerging trends and technologies in big data processing. Concurr Comput Pract Exp 27(8):2078–2091
Ciobanu R, Dobre C, Bălănescu M, Suciu G (2019) Data and task offloading in collaborative mobile fog-based networks. IEEE Access 7:104405–104422
Ciobanu R, Tăbuşcă V, Dobre C, Băjenaru L, Mavromoustakis CX, Mastorakis G (2019) Avoiding data corruption in drop computing mobile networks. IEEE Access 7:31170–31185
Ciobanu R-I, Dobre C (2019) Mobile interactions and computation offloading in drop computing. In: Advances in network-based information systems. Springer International Publishing, pp 361–373
Ciobanu R-I, Negru C, Pop F, Dobre C, Mavromoustakis CX, Mastorakis G (2019) Drop computing: ad-hoc dynamic collaborative computing. Futur Gener Comput Syst 92:889–899
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Sixth symposium on operating system design and implementation, OSDI’04, San Francisco, CA, pp 137–150
Dünner C, Parnell T, Atasu K, Sifalakis M, Pozidis H (2017) Understanding and optimizing the performance of distributed machine learning applications on apache spark. In: 2017 IEEE international conference on big data (big data), pp 331–338
Ghemawat S, Gobioff H, Leung S-T (2003) The Google file system. ACM SIGOPS Oper Syst Rev 37(5):29–43
Hadoop, Apache Software Foundation. https://hadoop.apache.org/
HBase, Apache Software Foundation. https://hbase.apache.org/
Hive, Apache Software Foundation. https://hive.apache.org/
Jia Z, Zhan J, Wang L, Luo C, Gao W, Jin Y, Han R, Zhang L (2017) Understanding big data analytics workloads on modern processors. IEEE Trans Parallel Distrib Syst 28(6):1797–1810
Lew J, Shah DA, Pati S, Cattell S, Zhang M, Sandhupatla A, Ng C, Goli N, Sinclair MD, Rogers TG, Aamodt TM (2019) Analyzing machine learning workloads using a detailed GPU simulator. In: 2019 IEEE international symposium on performance analysis of systems and software (ISPASS), pp 151–152
Lu J, Irena H (2019) Multi-model databases: a new journey to handle the variety of data. ACM Comput Surv 52(3)
Mohammadi Makrani H, Sayadi H, Pudukotai Dinakarra SM, Rafatirad S, Homayoun H (2018) A comprehensive memory analysis of data intensive workloads on server class architecture. In: Proceedings of the international symposium on memory systems, MEMSYS’18, New York, NY, USA. Association for Computing Machinery, pp 19–30
Marin R-C, Ciobanu R-I, Dobre C (2017) Improving opportunistic networks by leveraging device-to-device communication. IEEE Commun Mag 55(11):86–91
Mishra AK, Nurvitadhi E, Venkatesh G, Pearce J, Marr D (2017) Fine-grained accelerators for sparse machine learning workloads. In: 2017 22nd Asia and South Pacific design automation conference (ASP-DAC), pp 635–640
ORC, Apache Software Foundation. https://orc.apache.org/
Parquet, Apache Software Foundation. https://parquet.apache.org/
Płuciennik E, Zgorzałek K (2017) The multi-model databases: a review. In: Beyond databases, architectures and structures. Towards efficient solutions for data analysis and knowledge representation. Springer International Publishing, pp 141–152
Samza, Apache Software Foundation. https://samza.apache.org/
Samza–Core concepts, Apache Software Foundation. http://samza.apache.org/learn/documentation/latest/core-concepts/core-concepts.html
Spark, Apache Software Foundation. https://spark.apache.org/
Stan C-S, Pandelica A-E, Zamfir A-V, Stan R-G, Negru C (2019) Apache spark and apache ignite performance analysis. In: 2019 22nd international conference on control systems and computer science (CSCS), pp 726–733
Storm, Apache Software Foundation. https://storm.apache.org/
Storm–Concepts, Apache Software Foundation. https://storm.apache.org/releases/current/Concepts.html
Storm–Guaranteeing Message Processing, Apache Software Foundation. https://storm.apache.org/releases/current/Guaranteeing-message-processing.html
Wang M, Meng C, Long G, Wu C, Yang J, Lin W, Jia Y (2019) Characterizing deep learning training workloads on Alibaba-PAI. In: 2019 IEEE international symposium on workload characterization (IISWC), pp 189–202
Yousefpour A, Fung C, Nguyen T, Kadiyala K, Jalali F, Niakanlahiji A, Kong J, Jue JP (2019) All one needs to know about fog computing and related edge computing paradigms: a complete survey. J Syst Architect 98:289–330
Yu S, Zhang L, Li L, Yan B, Cai Z, Zhang L (2019) An efficient interest-aware data dissemination approach in opportunistic networks. Procedia Comput Sci 147:394–399
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12, USA. USENIX Association
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Stan, RG., Negru, C., Bajenaru, L., Pop, F. (2021). Data and Systems Heterogeneity: Analysis on Data, Processing, Workload, and Infrastructure. In: Pop, F., Neagu, G. (eds) Big Data Platforms and Applications. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-030-38836-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-38836-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38835-5
Online ISBN: 978-3-030-38836-2
eBook Packages: Computer ScienceComputer Science (R0)