Abstract
Each sector of the digital world generates enormous data as human life continues to transform. Areas like data analytics, data science, knowledge discovery in databases (KDD), machine learning, and artificial intelligence depend on highly distributed data which requires appropriate storage in a data lake. Collecting the data from different heterogeneous sources and creating a single lake of data is called data ingestion. Ironically, data ingestion has been treated as a less important stage in data analysis because it is considered a minor first step. There are several misconceptions in the data and analytics domain about data ingestion. The survey employed in this research presents a list of significant challenges faced by information technology (IT) industries during data ingestion. The available frameworks are compared in terms of standard parameters that are set against the existing challenges and myths. The findings from the comparison are compiled in a tabular format for easy reference. The paper places emphasis on the significance of data ingestion and attempts to present it as a major activity on the big data platform.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Qiao L et al (2015) Gobblin: Unifying data ingestion for hadoop. Proc VLDB Endow 8(12):1764–1760. https://doi.org/10.14778/2824032.2824073
Noghabi SA et al (2017) Samza: Stateful scalable stream processing at linkedin. Proc VLDB Endow 10(12):1634–1645. https://doi.org/10.14778/3137765.3137770
Isah H, Zulkernine F (2018) A scalable and robust framework for data stream ingestion. In: Proceedings—2018 IEEE international conference on big data, big data 2018, pp 2900–2905. https://doi.org/10.1109/BigData.2018.8622360
Rooney S, Bauer D, Garces-Erice L, Urbanetz P, Froese F, Tomic S (2019) Experiences with managing data ingestion into a corporate data lake. In: Proceeding of 2019 IEEE 5th international conference on collaboration and internet computing (CIC) no December, pp 101–109. https://doi.org/10.1109/CIC48465.2019.00021
Khine PP, Wang ZS (2018) Data lake: a new ideology in big data era. ITM Web Conf 17:03025. https://doi.org/10.1051/itmconf/20181703025
Processing P, The rise of big data means no one tool can rule
Zhao Y, Megdiche I, Ravat F (2021) Data lake ingestion management. Umr 5505, pp 1–12. (Online). Available: http://arxiv.org/abs/2107.02885
Alwidian J, Rahman SA, Gnaim M, Al-Taharwah F (2020) Big data ingestion and preparation tools. Mod Appl Sci 14(9):12. https://doi.org/10.5539/mas.v14n9p12
Erraissi A (2017) Digging into hadoop-based big data architectures. Int J Comput Sci Issues 14(6):52–59. https://doi.org/10.20943/01201706.5259
Davenport TH, Dyché J (2013) Big data in big companies. Baylor Bus Rev 32(1):20–21. (Online). Available: http://search.proquest.com/docview/1467720121? accountid=10067%5Cnhttp://sfx.lib.nccu.edu.tw/sfxlcl41?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&genre=article&sid=ProQ:ProQ:abiglobal&atitle=VIEW/REVIEW:+BIG+DATA+IN+BIG+COMPANIES&title=Bay
Pal G, Li G, Atkinson K (2018) Big data real time ingestion and machine learning. In: Proceeding of 2018 IEEE 2nd international conference data stream mining and processing DSMP 2018, pp 25–31. https://doi.org/10.1109/DSMP.2018.8478598
Ji C et al (2016) Device data ingestion for industrial big data platforms with a case study. Sensors (Switzerland) 16(3):1–15. https://doi.org/10.3390/s16030279
Ari I, Olmezogullari E, Celebi OF (2012) Data stream analytics and mining in the cloud. In: CloudCom 2012—proceeding of the 2012 4th IEEE international conference on cloud computer technology science, pp 857–862. https://doi.org/10.1109/CloudCom.2012.6427563
Maqbool Q, Habib A (2019) 5Big Data challenges. Control Eng 66(3):33. https://doi.org/10.4172/2324-9307.1000133
Mohanty A, Ranjana P (2018) A framework for effective processing of jobs in hadoop. Int J Eng Technol 7(4, 36):200–203. https://doi.org/10.14419/ijet.v7i4.36.23776
Cumbane SP, Gidófalvi G (2019) Review of big data and processing frameworks for disaster response applications. ISPRS Int J Geo-Information 8(9). https://doi.org/10.3390/ijgi8090387
Marcu O et al (2018) KerA : scalable data ingestion for stream processing to cite this version : HAL Id : hal-01773799 KerA : scalable data ingestion for stream processing. In: 2018 38th IEEE international conference distributed computer system
Shahin D, Ennab H, Saeed R, Alwidian J (2019) Big data platform privacy and security, a review. Int J Comput Sci Netw Secur 19(5):24–35
Wang J, Zhang W, Shi Y, Duan S, Liu J (2018) Industrial big data analytics: challenges, methodologies, and applications, pp 1–13 (Online). Available: http://arxiv.org/abs/1807.01016
Scholtes I, Systems D (2010) Pr ep rin t Pr ep. Search, vol 2010
Amare MY, Simonova S (2021) Learning analytics for higher education: proposal of big data ingestion architecture. SHS Web Conf. 92:02002. https://doi.org/10.1051/shsconf/20219202002
Kaisler S, Armour F, Espinosa JA, Money W (2013) Big data: issues and challenges moving forward. In: Proceeding of annual Hawaii international conference on system sciences, pp 995–1004. https://doi.org/10.1109/HICSS.2013.645
Moatti Y et al (2017) Too big to eat: boosting analytics data ingestion from object stores with scoop. In: Proceeding of international conference on data engineering, pp 309–320. https://doi.org/10.1109/ICDE.2017.243
Akanbi AK, Masinde M (2012) A framework for accurate drought forecasting system using semantics-based data integration middleware. In: Lecture notes institute computer science social telecommunication engineering LNICST, vol 171, no Cred, pp 106–110. https://doi.org/10.1007/978-3-319-43696-8_12
Matacuta A, Popa C (2018) Big data analytics: analysis of features and performance of big data ingestion tools. Informatica Economica 22(2/2018):25–34. https://doi.org/10.12948/issn14531305/22.2.2018.03
Defi T, Kafka definitive guide
Team DF (2021) Apache flume features & limitations of apache flume, data-flair, Dec 2020. (Online). Available: https://data-flair.training/blogs/flume-features-limitations/. Accessed 23 Oct 2021
Siciliani T (2017) Big data ingestion: flume, kafka, and NiFi. dzone, 07 July 2017. (Online). Available: https://dzone.com/articles/big-data-ingestion-flume-kafka-and-nifi. Accessed 23 Oct 2021
Apache flume team, apache flumeâ„¢, (Online). Available: https://flume.apache.org/FlumeUserGuide.html
Lewis N (2019) NiFi vs. Kafka… Or Is It?, Zirous, 03 July 2019. (Online). Available: https://www.zirous.com/2019/07/03/nifi-vs-kafka-or-is-it/. Accessed 10 Oct 2021
Santiago T (2019) NiFi sizing guide & deployment best practices. Cloudera, vol 17, no 8. (Online). Available: https://community.cloudera.com/t5/Community-Articles/NiFi-Sizing-Guide-Deployment-Best-Practices/ta-p/246781. Accessed 07 Oct 2021
Apache NiFi team, apache NiFi overview, (Online). Available: https://nifi.apache.org/docs/nifi-docs/html/overview.html
Han U, Ahn J (2014) Dynamic load balancing method for apache flume log processing. 79(1):83–86, 2014. https://doi.org/10.14257/astl.2014.79.16
Wu H, Shang Z, Wolter K (2019) Performance prediction for the apache kafka messaging system. In: Proceeding of 21st IEEE international conference high performance computing and communications, 17th IEEE international conference on smart city; IEEE 5th international conference on data science and systems HPCC/SmartCity/DSS, pp 154–161. https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00036
Chatti S (2019) Using spark, kafka and NIFI for future generation of ETL in IT industry. J Innov Inf Technol 3(2):11–14
Team O (2021) Monitoring ingest and delivery, Oracle, 08 2021. (Online). Available: https://docs.oracle.com/en/cloud/saas/data-cloud/data-cloud-help-center/Platform/RunningReports/ingest_delivery_monitoring.htm. Accessed 23 Oct 2021
Marcu O, Costan A, Antoniu G, Tudoran R, Bortoli S, Nicolae B (2018) Storage and ingestion systems in support of stream processing : a survey storage and ingestion systems in support of stream processing : a survey.
Wu H, Shang Z, Wolter K (2020) Learning to reliably deliver streaming data with apache kafka. In: Proceeding of 50th annual IEEE/IFIP international conference dependable system networks, DSN 2020, pp 564–571. https://doi.org/10.1109/DSN48063.2020.00068.
Wu H, Shang Z, Wolter K (2019) Trak: a testing tool for studying the reliability of data delivery in apache kafka. In: Proceeding of 2019 IEEE 30th International Symposium Software reliability engineering workshops (ISSREW), pp 394–397. https://doi.org/10.1109/ISSREW.2019.00101
Pandya A et al (2019) Privacy preserving sentiment analysis on multiple edge data streams with apache NiFi. In: Proceedings of 2019 european intelligence and security informatics conference EISIC, pp 130–133. https://doi.org/10.1109/EISIC49498.2019.9108851
Nagdive AS, Tugnayat RM, Regulwar G, P Petkar (2019) Web server log analysis for unstructured data using apache flume and pig. Int J Comput Sci Eng 7(3):220–225. https://doi.org/10.26438/ijcse/v7i3.220225
Jung S, Shin Y (2018) Study of the big data collection scheme based apache flume for log collection. Int J Comput Theory Eng 10(3):97–100. https://doi.org/10.7763/ijcte.2018.v10.1206
Ehrenstein S (2020) Scalability benchmarking of kafka streams applications. (Online). Available: http://oceanrep.geomar.de/49152/
Kafka A, Connect K, If N, Cloud C (2019) Kafka connect deep dive—JDBC source connector, pp 1–27
Thein KMM (2014) Apache kafka: next generation distributed messaging system. Int J Sci Eng Technol Res 3(47):9478–9483. (Online). Available: http://ijsetr.com/uploads/436215IJSETR3636-621.pdf
Pimpalkar A, Zade A, Jaronde D, Bajpai G, Bahe K (2020) Design & framework of real time twitter analysis using apache flume and spark for trending technology. Sch Int J Multidiscip Allied Stud 7(5):123. ISSN 2394–336X. https://doi.org/10.19085/sijmas070501
Wang G et al (2021) Consistency and completeness: rethinking distributed stream processing in apache kafka. In: Proceedings of ACM SIGMOD International Conference on Management Data, pp 2602–2613. https://doi.org/10.1145/3448016.3457556
Birjali M, Beni-Hssane A, Erritali M (2017) Analyzing social media through big data using infosphere biginsights and apache flume. Procedia Comput Sci 113:280–285. https://doi.org/10.1016/j.procs.2017.08.299
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Irfan, M., George, J.P. (2022). A Systematic Review of Challenges, Tools, and Myths of Big Data Ingestion. In: Shukla, S., Gao, XZ., Kureethara, J.V., Mishra, D. (eds) Data Science and Security. Lecture Notes in Networks and Systems, vol 462. Springer, Singapore. https://doi.org/10.1007/978-981-19-2211-4_43
Download citation
DOI: https://doi.org/10.1007/978-981-19-2211-4_43
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2210-7
Online ISBN: 978-981-19-2211-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)