Handling big data: research challenges and future directions

Abstract

Today, an enormous amount of data is being continuously generated in all walks of life by all kinds of devices and systems every day. A significant portion of such data is being captured, stored, aggregated and analyzed in a systematic way without losing its “4V” (i.e., volume, velocity, variety, and veracity) characteristics. We review major drivers of big data today as well the recent trends and established platforms that offer valuable perspectives on the information stored in large and heterogeneous data sets. Then, we present a classification of some of the most important challenges when handling big data. Based on this classification, we recommend solutions that could address the identified challenges, and in addition we highlight cross-disciplinary research directions that need further investigation in the future.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

Notes

  1. 1.

    IDC: http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm.

  2. 2.

    http://www.diabeticlink.org/.

  3. 3.

    http://hadoop.apache.org/.

  4. 4.

    http://www.pentaho.com/product/big-data-analytics.

  5. 5.

    https://mahout.apache.org/.

  6. 6.

    https://code.google.com/p/ml-hadoop/.

  7. 7.

    https://hive.apache.org/.

  8. 8.

    http://pig.apache.org/.

  9. 9.

    http://storm.apache.org/.

  10. 10.

    https://spark.apache.org/.

  11. 11.

    http://spark.apache.org/docs/latest/mllib-guide.html.

  12. 12.

    http://sqoop.apache.org/.

  13. 13.

    http://flume.apache.org/.

  14. 14.

    http://zookeeper.apache.org/.

  15. 15.

    https://github.com/twitter/elephant-bird/.

  16. 16.

    https://github.com/mbostock/d3/wiki/Gallery.

  17. 17.

    http://polymaps.org/.

  18. 18.

    http://www.nodexlgraphgallery.org/Pages/Default.aspx.

  19. 19.

    http://moebio.com/.

  20. 20.

    http://moebio.com/newk/twitter/.

  21. 21.

    https://networkx.github.io/.

  22. 22.

    http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_paper_c11-520862.pdf.

  23. 23.

    http://openrefine.org/.

  24. 24.

    http://vis.stanford.edu/wrangler/.

  25. 25.

    http://cran.r-project.org/web/packages/plyr/index.html.

  26. 26.

    http://www-01.ibm.com/software/data/infosphere/.

  27. 27.

    http://datacleaner.org/.

  28. 28.

    http://www.paxata.com/.

  29. 29.

    https://www.xplenty.com/.

  30. 30.

    http://docs.oasis-open.org/xri/2.0/specs/xri-resolution-V2.0.html.

  31. 31.

    https://hadoop.apache.org/.

  32. 32.

    https://hbase.apache.org/.

  33. 33.

    http://aws.amazon.com/dynamodb/.

  34. 34.

    http://www.amazon.com/b?node=8037720011.

  35. 35.

    http://creativecommons.org/.

  36. 36.

    http://www.europeana.eu/.

  37. 37.

    http://ocw.mit.edu/index.html.

  38. 38.

    http://publicspending.net/.

  39. 39.

    http://www.personalgenomes.org/.

  40. 40.

    http://www.ands.org.au/guides/ethics-working-level.html.

References

  1. 1.

    Jacobs A (2009) The pathologies of big data. Commun ACM 52(8):36–44

    Article  Google Scholar 

  2. 2.

    Madden S (2012) From databases to big data. IEEE Internet Comput 16(3):4–6

    Article  Google Scholar 

  3. 3.

    Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107

    Article  Google Scholar 

  4. 4.

    Gantz J, Reinsel D (2011) Extracting value from chaos. IDC iView, pp 1–12

  5. 5.

    Banaee H, Ahmed MU, Loutfi A (2013) Data mining for wearable sensors in health monitoring systems: a review of recent trends and challenges. Sensors 13(12):17472–17500

    Article  Google Scholar 

  6. 6.

    Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of ‘big data’ on cloud computing: review and open research issues. Inf Syst 47:98–115

    Article  Google Scholar 

  7. 7.

    Kwon O, Lee N, Shin B (2014) Data quality management, data usage experience and acquisition intention of big data analytics. Int J Inf Manag 34(3):387–394

    Article  Google Scholar 

  8. 8.

    Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH (2016) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, 2011. http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation. Accessed 12 January 2016

  9. 9.

    Pretz K (2016) Better health care through data: how health analytics could contain costs and improve care. The IEEE Institute, New York. http://theinstitute.ieee.org/technology-focus/technology-topic/better-health-care-through-data. Accessed 12 January 2016

  10. 10.

    Chen H, Compton S, Hsiao O (2013) DiabeticLink: a health big data system for patient empowerment and personalized healthcare, vol 8040. In: Smart health. Springer, Berlin, pp 71–83

  11. 11.

    O’Driscoll A, Daugelaite J, Sleator RD (2013) Big data. Hadoop and cloud computing in genomics. J Biomed Inf 46(5):774–781

  12. 12.

    Big Data Insight Group. http://www.thebigdatainsightgroup.com/site/article/nypd-make-big-apple-safer-big-data. Accessed 12 January 2016

  13. 13.

    Rozenfeld M (2016) The future of crime prevention. IEEE Institute, New York. http://theinstitute.ieee.org/technology-focus/technology-topic/the-future-of-crime-prevention. Accessed 12 January 2016

  14. 14.

    NASA Jet Propulsion Laboratory, Managing the deluge of ’Big Data’ from space. http://solarsystem.nasa.gov/news/display.cfm?News_ID=45192. Accessed 12 January 2016

  15. 15.

    Kambatla K, Kollias G, Kumar V, Grama A (2014) Trends in big data analytics. J Parallel Distrib Comput 74(7):2561–2573 ISSN 0743–7315

    Article  Google Scholar 

  16. 16.

    Atzeni P, Bugiotti F, Rossi L (2014) Uniform access to NoSQL systems. Inf Syst 43:117–133 ISSN 0306–4379

    Article  Google Scholar 

  17. 17.

    Chen M, Mao S, Liu Y (2014) Big data: a survey. Mobile Netw Appl 19(2):171–209

    MathSciNet  Article  Google Scholar 

  18. 18.

    Owen S, Anil R, Dunning T, Friedman E (2011) Mahout in action. Manning Publications Co, USA ISBN: 9781935182689

    Google Scholar 

  19. 19.

    Prakashbhai PA, Pandey HM (2014) Inference patterns from Big Data using aggregation, filtering and tagging—a survey. In: 5th international conference The next generation information technology summit (confluence), September 2014, pp 66–71

  20. 20.

    Hu H, Wen Y, Chua TS, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687

    Article  Google Scholar 

  21. 21.

    Che D, Safran M, Peng Z (2013) From big data to big data mining: challenges, issues, and opportunities. In: Lecture notes in computer science, vol 7827, pp 1–15

  22. 22.

    Tan W, Blake MB, Saleh I, Dustdar S (2013) Social-network-sourced big data analytics. IEEE Internet Comput 7(5):62–69

    Article  Google Scholar 

  23. 23.

    Lin J, Kolcz A (2012) Large-scale machine learning at twitter. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data (SIGMOD ’12). ACM, New York, pp 793–804

  24. 24.

    Liu J, Liu F, Ansari N (2014) Monitoring and analyzing big traffic data of a large-scale cellular network with Hadoop. IEEE Netw 28(4):32–39

    Article  Google Scholar 

  25. 25.

    Marchal S, Francois J, State R, Engel T (2014) Phishstorm: detecting phishing with streaming analytics. IEEE Trans Netw Serv Manag 11(4):458–471

    Article  Google Scholar 

  26. 26.

    Ma C, Zhang HH, Wang X (2014) Machine learning for Big Data analytics in plants. Trends Plant Sci 19(12):798–808

    Article  Google Scholar 

  27. 27.

    Chandola V, Sukumar SR, Schryver JC (2013) Knowledge discovery from massive healthcare claims data. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’13). ACM, New York, pp 1312–1320

  28. 28.

    3M Meeting Network. http://www.3rd-force.org/meetingnetwork/files/meetingguide_pres.pdf. Accessed 12 January 2016

  29. 29.

    Reda K, Febretti A, Knoll A, Aurisano J, Leigh J, Johnson AE, Papka ME, Hereld M (2013) Visualizing large, heterogeneous data in hybrid-reality environments. IEEE Comput Graph Appl 33(4):38–48

    Article  Google Scholar 

  30. 30.

    Philip Chen CL, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf Sci 275:314–347

    Article  Google Scholar 

  31. 31.

    Jagadish HV, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, Shahabi C (2014) Big data and its technical challenges. Commun ACM 57(7):86–94

    Article  Google Scholar 

  32. 32.

    Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proc VLDB Endow 5(12):2032–2033

    Article  Google Scholar 

  33. 33.

    Buneman P, Khanna S, Tan W (2000) Data provenance: some basic issues. In: Proceedings of foundations of software technology and theoretical computer science (FST TCS 2000). LNCS, vol 1974, pp 87–93

  34. 34.

    Price S, Flach PA (2013) A Higher-order data flow model for heterogeneous Big Data. In: 2013 IEEE international conference on big data, October 2013, pp 569–574

  35. 35.

    Xindong W, Xingquan Z, Gong-Qing W, Wei D (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107

    Article  Google Scholar 

  36. 36.

    Davis K, Patterson D (2012) Ethics of big data, O’Reilly. ISBN 978-1-4493-1179-7

  37. 37.

    Mann S (2012) Through the glass. Light IEEE Technol Soc Mag 31(3):10–14

    Article  Google Scholar 

  38. 38.

    Michael K, Miller KW (2013) Big data: new opportunities and new challenges. IEEE Comput 46(6):22–24

    Article  Google Scholar 

  39. 39.

    Kupwade PH, Seshadri R (2014) Big data security and privacy issues in healthcare. In: 2014 IEEE international congress on big data, pp 762–765

  40. 40.

    Volkovs M, Fei C, Szlichta J, Miller RJ (2014) Continuous data cleaning. In: 2014 IEEE 30th international conference on data engineering (ICDE), pp 244–255

  41. 41.

    Wang J, Song Z, Li Q, Yu J, Chen F (2014) Semantic-based intelligent data clean framework for big data. In: 2014 international conference on security, pattern analysis, and cybernetics (SPAC), pp 448–453

  42. 42.

    Stonebraker M, Bruckner D, Ilyas I, Beskales G, Cherniack M, Zdonik S, Pagan A, Xu S (2013) Data curation at scale: the data tamer system. In: Proceedings of biennial ACM conference on innovative data systems research (CIDR’13), Alisomar

  43. 43.

    Bansal SK (2014) Towards a semantic extract-transform-load (ETL) framework for big data integration. In: 2014 IEEE international congress on big data (BigData Congress), pp 522–529

  44. 44.

    Kadadi A, Agrawal R, Nyamful C, Atiq R (2014) Challenges of data integration and interoperability in big data. In: 2014 IEEE international conference on big data (Big Data), pp 38–40

  45. 45.

    Dong XL, Srivastava D (2013) Big data integration. In: 2013 IEEE 29th international conference on data engineering (ICDE), pp 1245–1248

  46. 46.

    Sowe SK, Zettsu K (2013) The architecture and design of a community-based cloud platform for curating big data. In: 2013 international conference on cyber-enabled distributed computing and knowledge discovery (CyberC), pp 171–178

  47. 47.

    O’Leary DE (2014) Embedding AI and crowdsourcing in the big data lake. IEEE Intell Syst 29(5):70–73

    Article  Google Scholar 

  48. 48.

    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. ACM Commun 51(1):107–113

    Article  Google Scholar 

  49. 49.

    Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst 26(2):1–26

  50. 50.

    Kumar KA, Quamar A, Deshpande A, Khuller S (2014) SWORD: workload-aware data placement and replica selection for cloud data management systems. VLDB J 23(6):845–870

    Article  Google Scholar 

  51. 51.

    Wang Z, Zhu W, Chen X, Sun L, Liu J, Chen M, Cui P, Yang S (2013) Propagation-based social-aware multimedia content distribution. ACM Trans Multimed Comput Commun Appl (TOMM) 9(1):52:1–52:20

    Google Scholar 

  52. 52.

    Wang Z, Zhu W, Chen M, Sun L, Yang S (2015) CPCDN: content delivery powered by context and user intelligence. IEEE Trans Multimed 17(1):92–103

    Article  Google Scholar 

  53. 53.

    Menglan H, Jun L, Yang W, Veeravalli B (2014) Practical resource provisioning and caching with dynamic resilience for cloud-based content distribution networks. IEEE Trans Parall Distrib Syst 25(8):2169–2179

    Article  Google Scholar 

  54. 54.

    Suto K, Nishiyama H, Kato N, Nakachi T, Fujii T, Takahara A (2014) Toward integrating overlay and physical networks for robust parallel processing architecture. IEEE Netw 28(4):40–45

    Article  Google Scholar 

  55. 55.

    Jiayi L, Rosenberg C, Simon G, Texier G (2014) Optimal delivery of rate-adaptive streams in underprovisioned networks. IEEE J Select Areas Commun 32(4):706–718

    Article  Google Scholar 

  56. 56.

    Fiore S, D’Anca A, Elia D, Palazzo C, Foster I, Williams D, Aloisio G (2014) Ophidia: a full software stack for scientific data analytics. In: 2014 international conference on high performance computing & simulation (HPCS), pp 343–350

  57. 57.

    Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell (IJPRAI) 9(2):201–229. (Special issue on VLSI Algorithms and Architectures for Computer Vision. Image Processing, Pattern Recognition and AI)

  58. 58.

    Heinze T, Pappalardo V, Jerzak Z, Fetzer C (2014) Auto-scaling techniques for elastic data stream processing. In: 2014 IEEE 30th international conference on data engineering workshops (ICDEW), pp 296–302

  59. 59.

    Hsiang HW, Tse CY, Chien MW (2014) Multiple two-phase data processing with mapreduce. In: 2014 IEEE 7th international conference on cloud computing (CLOUD), pp 352–359

  60. 60.

    Arif Wani M, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable multi-ring network. J Supercomput 25(1):43–63

    Article  MATH  Google Scholar 

  61. 61.

    Mokhtari R, Stumm M (2014) BigKernel—high performance CPU-GPU communication pipelining for big data-style applications. In: 2014 IEEE 28th international parallel and distributed processing symposium, pp 819–828

  62. 62.

    Chatterjee A, Radhakrishnan S, Sekharan CN (2014) Connecting the dots: triangle completion and related problems on large data sets using GPUs. In: 2014 IEEE international conference on big data (Big Data), pp 1–8

  63. 63.

    Shahar Y (1997) A framework for knowledge-based temporal abstraction. Elsevier Artif Intell 90(1–2):79–133

    Article  MATH  Google Scholar 

  64. 64.

    Tajer A, Veeravalli VV, Poor HV (2014) Outlying sequence detection in large data sets: a data-driven approach. IEEE Signal Process Mag 31(5):44–56

    Article  Google Scholar 

  65. 65.

    Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor: theoretical properties and algorithms. Elsevier Parall Comput 21(11):1783–1806

    Article  Google Scholar 

  66. 66.

    Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parall Distrib Comput 24(1):107–114

    Article  Google Scholar 

  67. 67.

    Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–270

    Article  MATH  Google Scholar 

  68. 68.

    Vafopoulos M, Meimaris M, Anagnostopoulos I, Papantoniou A, Xidias I, Alexiou G, Vafeiadis G, Klonaras M, Loumos V (2015) Public spending as LOD: the case of Greece. Seman Web Interoperabil Usabil Applicabil Seman Web 6(2):155–164

    Google Scholar 

  69. 69.

    Ekbia H, Mattioli M, Kouper I, Arave G, Ghazinejad A, Bowman T, Suri VR, Tsou A, Weingart S, Sugimoto CR (2014) Big data, bigger dilemmas: a critical review. J Assoc Inf Sci Technol. Wiley, New York

  70. 70.

    Smith M, Szongott C, Henne B, von Voigt G (2012) Big data privacy issues in public social media. In: 6th IEEE international conference on digital ecosystems technologies (DEST), pp 1–6

  71. 71.

    Zhang X, Dou W, Pei J, Nepal S, Yang C, Liu C, Chen J (2015) Proximity-aware local-recoding anonymization with mapreduce for scalable big data privacy preservation in cloud. IEEE Trans Comput 64(8):2293–2307

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgments

We express our gratitude to Emna Mezghani for her contributions to this work. The authors thank the anonymous reviewers for their valuable comments and suggestions.

Author information

Affiliations

Authors

Corresponding author

Correspondence to S. Zeadally.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Anagnostopoulos, I., Zeadally, S. & Exposito, E. Handling big data: research challenges and future directions. J Supercomput 72, 1494–1516 (2016). https://doi.org/10.1007/s11227-016-1677-z

Download citation

Keywords

  • Big data
  • Data curation
  • Data cleansing
  • Data analytics
  • Privacy
  • Trust