The Journal of Supercomputing

, Volume 72, Issue 4, pp 1494–1516 | Cite as

Handling big data: research challenges and future directions

Article

Abstract

Today, an enormous amount of data is being continuously generated in all walks of life by all kinds of devices and systems every day. A significant portion of such data is being captured, stored, aggregated and analyzed in a systematic way without losing its “4V” (i.e., volume, velocity, variety, and veracity) characteristics. We review major drivers of big data today as well the recent trends and established platforms that offer valuable perspectives on the information stored in large and heterogeneous data sets. Then, we present a classification of some of the most important challenges when handling big data. Based on this classification, we recommend solutions that could address the identified challenges, and in addition we highlight cross-disciplinary research directions that need further investigation in the future.

Keywords

Big data Data curation Data cleansing Data analytics Privacy Trust 

References

  1. 1.
    Jacobs A (2009) The pathologies of big data. Commun ACM 52(8):36–44CrossRefGoogle Scholar
  2. 2.
    Madden S (2012) From databases to big data. IEEE Internet Comput 16(3):4–6CrossRefGoogle Scholar
  3. 3.
    Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107CrossRefGoogle Scholar
  4. 4.
    Gantz J, Reinsel D (2011) Extracting value from chaos. IDC iView, pp 1–12Google Scholar
  5. 5.
    Banaee H, Ahmed MU, Loutfi A (2013) Data mining for wearable sensors in health monitoring systems: a review of recent trends and challenges. Sensors 13(12):17472–17500CrossRefGoogle Scholar
  6. 6.
    Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of ‘big data’ on cloud computing: review and open research issues. Inf Syst 47:98–115CrossRefGoogle Scholar
  7. 7.
    Kwon O, Lee N, Shin B (2014) Data quality management, data usage experience and acquisition intention of big data analytics. Int J Inf Manag 34(3):387–394CrossRefGoogle Scholar
  8. 8.
    Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH (2016) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, 2011. http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation. Accessed 12 January 2016
  9. 9.
    Pretz K (2016) Better health care through data: how health analytics could contain costs and improve care. The IEEE Institute, New York. http://theinstitute.ieee.org/technology-focus/technology-topic/better-health-care-through-data. Accessed 12 January 2016
  10. 10.
    Chen H, Compton S, Hsiao O (2013) DiabeticLink: a health big data system for patient empowerment and personalized healthcare, vol 8040. In: Smart health. Springer, Berlin, pp 71–83Google Scholar
  11. 11.
    O’Driscoll A, Daugelaite J, Sleator RD (2013) Big data. Hadoop and cloud computing in genomics. J Biomed Inf 46(5):774–781Google Scholar
  12. 12.
  13. 13.
    Rozenfeld M (2016) The future of crime prevention. IEEE Institute, New York. http://theinstitute.ieee.org/technology-focus/technology-topic/the-future-of-crime-prevention. Accessed 12 January 2016
  14. 14.
    NASA Jet Propulsion Laboratory, Managing the deluge of ’Big Data’ from space. http://solarsystem.nasa.gov/news/display.cfm?News_ID=45192. Accessed 12 January 2016
  15. 15.
    Kambatla K, Kollias G, Kumar V, Grama A (2014) Trends in big data analytics. J Parallel Distrib Comput 74(7):2561–2573 ISSN 0743–7315CrossRefGoogle Scholar
  16. 16.
    Atzeni P, Bugiotti F, Rossi L (2014) Uniform access to NoSQL systems. Inf Syst 43:117–133 ISSN 0306–4379CrossRefGoogle Scholar
  17. 17.
    Chen M, Mao S, Liu Y (2014) Big data: a survey. Mobile Netw Appl 19(2):171–209MathSciNetCrossRefGoogle Scholar
  18. 18.
    Owen S, Anil R, Dunning T, Friedman E (2011) Mahout in action. Manning Publications Co, USA ISBN: 9781935182689Google Scholar
  19. 19.
    Prakashbhai PA, Pandey HM (2014) Inference patterns from Big Data using aggregation, filtering and tagging—a survey. In: 5th international conference The next generation information technology summit (confluence), September 2014, pp 66–71Google Scholar
  20. 20.
    Hu H, Wen Y, Chua TS, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687CrossRefGoogle Scholar
  21. 21.
    Che D, Safran M, Peng Z (2013) From big data to big data mining: challenges, issues, and opportunities. In: Lecture notes in computer science, vol 7827, pp 1–15Google Scholar
  22. 22.
    Tan W, Blake MB, Saleh I, Dustdar S (2013) Social-network-sourced big data analytics. IEEE Internet Comput 7(5):62–69CrossRefGoogle Scholar
  23. 23.
    Lin J, Kolcz A (2012) Large-scale machine learning at twitter. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data (SIGMOD ’12). ACM, New York, pp 793–804Google Scholar
  24. 24.
    Liu J, Liu F, Ansari N (2014) Monitoring and analyzing big traffic data of a large-scale cellular network with Hadoop. IEEE Netw 28(4):32–39CrossRefGoogle Scholar
  25. 25.
    Marchal S, Francois J, State R, Engel T (2014) Phishstorm: detecting phishing with streaming analytics. IEEE Trans Netw Serv Manag 11(4):458–471CrossRefGoogle Scholar
  26. 26.
    Ma C, Zhang HH, Wang X (2014) Machine learning for Big Data analytics in plants. Trends Plant Sci 19(12):798–808CrossRefGoogle Scholar
  27. 27.
    Chandola V, Sukumar SR, Schryver JC (2013) Knowledge discovery from massive healthcare claims data. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’13). ACM, New York, pp 1312–1320Google Scholar
  28. 28.
    3M Meeting Network. http://www.3rd-force.org/meetingnetwork/files/meetingguide_pres.pdf. Accessed 12 January 2016
  29. 29.
    Reda K, Febretti A, Knoll A, Aurisano J, Leigh J, Johnson AE, Papka ME, Hereld M (2013) Visualizing large, heterogeneous data in hybrid-reality environments. IEEE Comput Graph Appl 33(4):38–48CrossRefGoogle Scholar
  30. 30.
    Philip Chen CL, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf Sci 275:314–347CrossRefGoogle Scholar
  31. 31.
    Jagadish HV, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, Shahabi C (2014) Big data and its technical challenges. Commun ACM 57(7):86–94CrossRefGoogle Scholar
  32. 32.
    Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proc VLDB Endow 5(12):2032–2033CrossRefGoogle Scholar
  33. 33.
    Buneman P, Khanna S, Tan W (2000) Data provenance: some basic issues. In: Proceedings of foundations of software technology and theoretical computer science (FST TCS 2000). LNCS, vol 1974, pp 87–93Google Scholar
  34. 34.
    Price S, Flach PA (2013) A Higher-order data flow model for heterogeneous Big Data. In: 2013 IEEE international conference on big data, October 2013, pp 569–574Google Scholar
  35. 35.
    Xindong W, Xingquan Z, Gong-Qing W, Wei D (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107CrossRefGoogle Scholar
  36. 36.
    Davis K, Patterson D (2012) Ethics of big data, O’Reilly. ISBN 978-1-4493-1179-7Google Scholar
  37. 37.
    Mann S (2012) Through the glass. Light IEEE Technol Soc Mag 31(3):10–14CrossRefGoogle Scholar
  38. 38.
    Michael K, Miller KW (2013) Big data: new opportunities and new challenges. IEEE Comput 46(6):22–24CrossRefGoogle Scholar
  39. 39.
    Kupwade PH, Seshadri R (2014) Big data security and privacy issues in healthcare. In: 2014 IEEE international congress on big data, pp 762–765Google Scholar
  40. 40.
    Volkovs M, Fei C, Szlichta J, Miller RJ (2014) Continuous data cleaning. In: 2014 IEEE 30th international conference on data engineering (ICDE), pp 244–255Google Scholar
  41. 41.
    Wang J, Song Z, Li Q, Yu J, Chen F (2014) Semantic-based intelligent data clean framework for big data. In: 2014 international conference on security, pattern analysis, and cybernetics (SPAC), pp 448–453Google Scholar
  42. 42.
    Stonebraker M, Bruckner D, Ilyas I, Beskales G, Cherniack M, Zdonik S, Pagan A, Xu S (2013) Data curation at scale: the data tamer system. In: Proceedings of biennial ACM conference on innovative data systems research (CIDR’13), AlisomarGoogle Scholar
  43. 43.
    Bansal SK (2014) Towards a semantic extract-transform-load (ETL) framework for big data integration. In: 2014 IEEE international congress on big data (BigData Congress), pp 522–529Google Scholar
  44. 44.
    Kadadi A, Agrawal R, Nyamful C, Atiq R (2014) Challenges of data integration and interoperability in big data. In: 2014 IEEE international conference on big data (Big Data), pp 38–40Google Scholar
  45. 45.
    Dong XL, Srivastava D (2013) Big data integration. In: 2013 IEEE 29th international conference on data engineering (ICDE), pp 1245–1248Google Scholar
  46. 46.
    Sowe SK, Zettsu K (2013) The architecture and design of a community-based cloud platform for curating big data. In: 2013 international conference on cyber-enabled distributed computing and knowledge discovery (CyberC), pp 171–178Google Scholar
  47. 47.
    O’Leary DE (2014) Embedding AI and crowdsourcing in the big data lake. IEEE Intell Syst 29(5):70–73CrossRefGoogle Scholar
  48. 48.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. ACM Commun 51(1):107–113CrossRefGoogle Scholar
  49. 49.
    Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst 26(2):1–26Google Scholar
  50. 50.
    Kumar KA, Quamar A, Deshpande A, Khuller S (2014) SWORD: workload-aware data placement and replica selection for cloud data management systems. VLDB J 23(6):845–870CrossRefGoogle Scholar
  51. 51.
    Wang Z, Zhu W, Chen X, Sun L, Liu J, Chen M, Cui P, Yang S (2013) Propagation-based social-aware multimedia content distribution. ACM Trans Multimed Comput Commun Appl (TOMM) 9(1):52:1–52:20Google Scholar
  52. 52.
    Wang Z, Zhu W, Chen M, Sun L, Yang S (2015) CPCDN: content delivery powered by context and user intelligence. IEEE Trans Multimed 17(1):92–103CrossRefGoogle Scholar
  53. 53.
    Menglan H, Jun L, Yang W, Veeravalli B (2014) Practical resource provisioning and caching with dynamic resilience for cloud-based content distribution networks. IEEE Trans Parall Distrib Syst 25(8):2169–2179CrossRefGoogle Scholar
  54. 54.
    Suto K, Nishiyama H, Kato N, Nakachi T, Fujii T, Takahara A (2014) Toward integrating overlay and physical networks for robust parallel processing architecture. IEEE Netw 28(4):40–45CrossRefGoogle Scholar
  55. 55.
    Jiayi L, Rosenberg C, Simon G, Texier G (2014) Optimal delivery of rate-adaptive streams in underprovisioned networks. IEEE J Select Areas Commun 32(4):706–718CrossRefGoogle Scholar
  56. 56.
    Fiore S, D’Anca A, Elia D, Palazzo C, Foster I, Williams D, Aloisio G (2014) Ophidia: a full software stack for scientific data analytics. In: 2014 international conference on high performance computing & simulation (HPCS), pp 343–350Google Scholar
  57. 57.
    Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell (IJPRAI) 9(2):201–229. (Special issue on VLSI Algorithms and Architectures for Computer Vision. Image Processing, Pattern Recognition and AI) Google Scholar
  58. 58.
    Heinze T, Pappalardo V, Jerzak Z, Fetzer C (2014) Auto-scaling techniques for elastic data stream processing. In: 2014 IEEE 30th international conference on data engineering workshops (ICDEW), pp 296–302Google Scholar
  59. 59.
    Hsiang HW, Tse CY, Chien MW (2014) Multiple two-phase data processing with mapreduce. In: 2014 IEEE 7th international conference on cloud computing (CLOUD), pp 352–359Google Scholar
  60. 60.
    Arif Wani M, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable multi-ring network. J Supercomput 25(1):43–63CrossRefMATHGoogle Scholar
  61. 61.
    Mokhtari R, Stumm M (2014) BigKernel—high performance CPU-GPU communication pipelining for big data-style applications. In: 2014 IEEE 28th international parallel and distributed processing symposium, pp 819–828Google Scholar
  62. 62.
    Chatterjee A, Radhakrishnan S, Sekharan CN (2014) Connecting the dots: triangle completion and related problems on large data sets using GPUs. In: 2014 IEEE international conference on big data (Big Data), pp 1–8Google Scholar
  63. 63.
    Shahar Y (1997) A framework for knowledge-based temporal abstraction. Elsevier Artif Intell 90(1–2):79–133CrossRefMATHGoogle Scholar
  64. 64.
    Tajer A, Veeravalli VV, Poor HV (2014) Outlying sequence detection in large data sets: a data-driven approach. IEEE Signal Process Mag 31(5):44–56CrossRefGoogle Scholar
  65. 65.
    Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor: theoretical properties and algorithms. Elsevier Parall Comput 21(11):1783–1806CrossRefGoogle Scholar
  66. 66.
    Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parall Distrib Comput 24(1):107–114CrossRefGoogle Scholar
  67. 67.
    Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–270CrossRefMATHGoogle Scholar
  68. 68.
    Vafopoulos M, Meimaris M, Anagnostopoulos I, Papantoniou A, Xidias I, Alexiou G, Vafeiadis G, Klonaras M, Loumos V (2015) Public spending as LOD: the case of Greece. Seman Web Interoperabil Usabil Applicabil Seman Web 6(2):155–164Google Scholar
  69. 69.
    Ekbia H, Mattioli M, Kouper I, Arave G, Ghazinejad A, Bowman T, Suri VR, Tsou A, Weingart S, Sugimoto CR (2014) Big data, bigger dilemmas: a critical review. J Assoc Inf Sci Technol. Wiley, New YorkGoogle Scholar
  70. 70.
    Smith M, Szongott C, Henne B, von Voigt G (2012) Big data privacy issues in public social media. In: 6th IEEE international conference on digital ecosystems technologies (DEST), pp 1–6Google Scholar
  71. 71.
    Zhang X, Dou W, Pei J, Nepal S, Yang C, Liu C, Chen J (2015) Proximity-aware local-recoding anonymization with mapreduce for scalable big data privacy preservation in cloud. IEEE Trans Comput 64(8):2293–2307MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.University of ThessalyNea IoniaGreece
  2. 2.University of KentuckyLexingtonUSA
  3. 3.University of ToulouseToulouseFrance

Personalised recommendations