Skip to main content

A Systematic Review of Challenges, Tools, and Myths of Big Data Ingestion

  • Conference paper
  • First Online:
Data Science and Security

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 462))

Abstract

Each sector of the digital world generates enormous data as human life continues to transform. Areas like data analytics, data science, knowledge discovery in databases (KDD), machine learning, and artificial intelligence depend on highly distributed data which requires appropriate storage in a data lake. Collecting the data from different heterogeneous sources and creating a single lake of data is called data ingestion. Ironically, data ingestion has been treated as a less important stage in data analysis because it is considered a minor first step. There are several misconceptions in the data and analytics domain about data ingestion. The survey employed in this research presents a list of significant challenges faced by information technology (IT) industries during data ingestion. The available frameworks are compared in terms of standard parameters that are set against the existing challenges and myths. The findings from the comparison are compiled in a tabular format for easy reference. The paper places emphasis on the significance of data ingestion and attempts to present it as a major activity on the big data platform.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Qiao L et al (2015) Gobblin: Unifying data ingestion for hadoop. Proc VLDB Endow 8(12):1764–1760. https://doi.org/10.14778/2824032.2824073

    Article  Google Scholar 

  2. Noghabi SA et al (2017) Samza: Stateful scalable stream processing at linkedin. Proc VLDB Endow 10(12):1634–1645. https://doi.org/10.14778/3137765.3137770

    Article  Google Scholar 

  3. Isah H, Zulkernine F (2018) A scalable and robust framework for data stream ingestion. In: Proceedings—2018 IEEE international conference on big data, big data 2018, pp 2900–2905. https://doi.org/10.1109/BigData.2018.8622360

  4. Rooney S, Bauer D, Garces-Erice L, Urbanetz P, Froese F, Tomic S (2019) Experiences with managing data ingestion into a corporate data lake. In: Proceeding of 2019 IEEE 5th international conference on collaboration and internet computing (CIC) no December, pp 101–109. https://doi.org/10.1109/CIC48465.2019.00021

  5. Khine PP, Wang ZS (2018) Data lake: a new ideology in big data era. ITM Web Conf 17:03025. https://doi.org/10.1051/itmconf/20181703025

    Article  Google Scholar 

  6. Processing P, The rise of big data means no one tool can rule

    Google Scholar 

  7. Zhao Y, Megdiche I, Ravat F (2021) Data lake ingestion management. Umr 5505, pp 1–12. (Online). Available: http://arxiv.org/abs/2107.02885

  8. Alwidian J, Rahman SA, Gnaim M, Al-Taharwah F (2020) Big data ingestion and preparation tools. Mod Appl Sci 14(9):12. https://doi.org/10.5539/mas.v14n9p12

    Article  Google Scholar 

  9. Erraissi A (2017) Digging into hadoop-based big data architectures. Int J Comput Sci Issues 14(6):52–59. https://doi.org/10.20943/01201706.5259

    Article  Google Scholar 

  10. Davenport TH, Dyché J (2013) Big data in big companies. Baylor Bus Rev 32(1):20–21. (Online). Available: http://search.proquest.com/docview/1467720121? accountid=10067%5Cnhttp://sfx.lib.nccu.edu.tw/sfxlcl41?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&genre=article&sid=ProQ:ProQ:abiglobal&atitle=VIEW/REVIEW:+BIG+DATA+IN+BIG+COMPANIES&title=Bay

  11. Pal G, Li G, Atkinson K (2018) Big data real time ingestion and machine learning. In: Proceeding of 2018 IEEE 2nd international conference data stream mining and processing DSMP 2018, pp 25–31. https://doi.org/10.1109/DSMP.2018.8478598

  12. Ji C et al (2016) Device data ingestion for industrial big data platforms with a case study. Sensors (Switzerland) 16(3):1–15. https://doi.org/10.3390/s16030279

    Article  Google Scholar 

  13. Ari I, Olmezogullari E, Celebi OF (2012) Data stream analytics and mining in the cloud. In: CloudCom 2012—proceeding of the 2012 4th IEEE international conference on cloud computer technology science, pp 857–862. https://doi.org/10.1109/CloudCom.2012.6427563

  14. Maqbool Q, Habib A (2019) 5Big Data challenges. Control Eng 66(3):33. https://doi.org/10.4172/2324-9307.1000133

    Article  Google Scholar 

  15. Mohanty A, Ranjana P (2018) A framework for effective processing of jobs in hadoop. Int J Eng Technol 7(4, 36):200–203. https://doi.org/10.14419/ijet.v7i4.36.23776

  16. Cumbane SP, Gidófalvi G (2019) Review of big data and processing frameworks for disaster response applications. ISPRS Int J Geo-Information 8(9). https://doi.org/10.3390/ijgi8090387

  17. Marcu O et al (2018) KerA : scalable data ingestion for stream processing to cite this version : HAL Id : hal-01773799 KerA : scalable data ingestion for stream processing. In: 2018 38th IEEE international conference distributed computer system

    Google Scholar 

  18. Shahin D, Ennab H, Saeed R, Alwidian J (2019) Big data platform privacy and security, a review. Int J Comput Sci Netw Secur 19(5):24–35

    Google Scholar 

  19. Wang J, Zhang W, Shi Y, Duan S, Liu J (2018) Industrial big data analytics: challenges, methodologies, and applications, pp 1–13 (Online). Available: http://arxiv.org/abs/1807.01016

  20. Scholtes I, Systems D (2010) Pr ep rin t Pr ep. Search, vol 2010

    Google Scholar 

  21. Amare MY, Simonova S (2021) Learning analytics for higher education: proposal of big data ingestion architecture. SHS Web Conf. 92:02002. https://doi.org/10.1051/shsconf/20219202002

    Article  Google Scholar 

  22. Kaisler S, Armour F, Espinosa JA, Money W (2013) Big data: issues and challenges moving forward. In: Proceeding of annual Hawaii international conference on system sciences, pp 995–1004. https://doi.org/10.1109/HICSS.2013.645

  23. Moatti Y et al (2017) Too big to eat: boosting analytics data ingestion from object stores with scoop. In: Proceeding of international conference on data engineering, pp 309–320. https://doi.org/10.1109/ICDE.2017.243

  24. Akanbi AK, Masinde M (2012) A framework for accurate drought forecasting system using semantics-based data integration middleware. In: Lecture notes institute computer science social telecommunication engineering LNICST, vol 171, no Cred, pp 106–110. https://doi.org/10.1007/978-3-319-43696-8_12

  25. Matacuta A, Popa C (2018) Big data analytics: analysis of features and performance of big data ingestion tools. Informatica Economica 22(2/2018):25–34. https://doi.org/10.12948/issn14531305/22.2.2018.03

    Article  Google Scholar 

  26. Defi T, Kafka definitive guide

    Google Scholar 

  27. Team DF (2021) Apache flume features & limitations of apache flume, data-flair, Dec 2020. (Online). Available: https://data-flair.training/blogs/flume-features-limitations/. Accessed 23 Oct 2021

  28. Siciliani T (2017) Big data ingestion: flume, kafka, and NiFi. dzone, 07 July 2017. (Online). Available: https://dzone.com/articles/big-data-ingestion-flume-kafka-and-nifi. Accessed 23 Oct 2021

  29. Apache flume team, apache flumeâ„¢, (Online). Available: https://flume.apache.org/FlumeUserGuide.html

  30. Lewis N (2019) NiFi vs. Kafka… Or Is It?, Zirous, 03 July 2019. (Online). Available: https://www.zirous.com/2019/07/03/nifi-vs-kafka-or-is-it/. Accessed 10 Oct 2021

  31. Santiago T (2019) NiFi sizing guide & deployment best practices. Cloudera, vol 17, no 8. (Online). Available: https://community.cloudera.com/t5/Community-Articles/NiFi-Sizing-Guide-Deployment-Best-Practices/ta-p/246781. Accessed 07 Oct 2021

  32. Apache NiFi team, apache NiFi overview, (Online). Available: https://nifi.apache.org/docs/nifi-docs/html/overview.html

  33. Han U, Ahn J (2014) Dynamic load balancing method for apache flume log processing. 79(1):83–86, 2014. https://doi.org/10.14257/astl.2014.79.16

  34. Wu H, Shang Z, Wolter K (2019) Performance prediction for the apache kafka messaging system. In: Proceeding of 21st IEEE international conference high performance computing and communications, 17th IEEE international conference on smart city; IEEE 5th international conference on data science and systems HPCC/SmartCity/DSS, pp 154–161. https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00036

  35. Chatti S (2019) Using spark, kafka and NIFI for future generation of ETL in IT industry. J Innov Inf Technol 3(2):11–14

    Google Scholar 

  36. Team O (2021) Monitoring ingest and delivery, Oracle, 08 2021. (Online). Available: https://docs.oracle.com/en/cloud/saas/data-cloud/data-cloud-help-center/Platform/RunningReports/ingest_delivery_monitoring.htm. Accessed 23 Oct 2021

  37. Marcu O, Costan A, Antoniu G, Tudoran R, Bortoli S, Nicolae B (2018) Storage and ingestion systems in support of stream processing : a survey storage and ingestion systems in support of stream processing : a survey.

    Google Scholar 

  38. Wu H, Shang Z, Wolter K (2020) Learning to reliably deliver streaming data with apache kafka. In: Proceeding of 50th annual IEEE/IFIP international conference dependable system networks, DSN 2020, pp 564–571. https://doi.org/10.1109/DSN48063.2020.00068.

  39. Wu H, Shang Z, Wolter K (2019) Trak: a testing tool for studying the reliability of data delivery in apache kafka. In: Proceeding of 2019 IEEE 30th International Symposium Software reliability engineering workshops (ISSREW), pp 394–397. https://doi.org/10.1109/ISSREW.2019.00101

  40. Pandya A et al (2019) Privacy preserving sentiment analysis on multiple edge data streams with apache NiFi. In: Proceedings of 2019 european intelligence and security informatics conference EISIC, pp 130–133. https://doi.org/10.1109/EISIC49498.2019.9108851

  41. Nagdive AS, Tugnayat RM, Regulwar G, P Petkar (2019) Web server log analysis for unstructured data using apache flume and pig. Int J Comput Sci Eng 7(3):220–225. https://doi.org/10.26438/ijcse/v7i3.220225

  42. Jung S, Shin Y (2018) Study of the big data collection scheme based apache flume for log collection. Int J Comput Theory Eng 10(3):97–100. https://doi.org/10.7763/ijcte.2018.v10.1206

    Article  Google Scholar 

  43. Ehrenstein S (2020) Scalability benchmarking of kafka streams applications. (Online). Available: http://oceanrep.geomar.de/49152/

  44. Kafka A, Connect K, If N, Cloud C (2019) Kafka connect deep dive—JDBC source connector, pp 1–27

    Google Scholar 

  45. Thein KMM (2014) Apache kafka: next generation distributed messaging system. Int J Sci Eng Technol Res 3(47):9478–9483. (Online). Available: http://ijsetr.com/uploads/436215IJSETR3636-621.pdf

  46. Pimpalkar A, Zade A, Jaronde D, Bajpai G, Bahe K (2020) Design & framework of real time twitter analysis using apache flume and spark for trending technology. Sch Int J Multidiscip Allied Stud 7(5):123. ISSN 2394–336X. https://doi.org/10.19085/sijmas070501

  47. Wang G et al (2021) Consistency and completeness: rethinking distributed stream processing in apache kafka. In: Proceedings of ACM SIGMOD International Conference on Management Data, pp 2602–2613. https://doi.org/10.1145/3448016.3457556

  48. Birjali M, Beni-Hssane A, Erritali M (2017) Analyzing social media through big data using infosphere biginsights and apache flume. Procedia Comput Sci 113:280–285. https://doi.org/10.1016/j.procs.2017.08.299

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Irfan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Irfan, M., George, J.P. (2022). A Systematic Review of Challenges, Tools, and Myths of Big Data Ingestion. In: Shukla, S., Gao, XZ., Kureethara, J.V., Mishra, D. (eds) Data Science and Security. Lecture Notes in Networks and Systems, vol 462. Springer, Singapore. https://doi.org/10.1007/978-981-19-2211-4_43

Download citation

Publish with us

Policies and ethics