Big Data: A Survey

Abstract

In this paper, we review the background and state-of-the-art of big data. We first introduce the general background of big data and review related technologies, such as could computing, Internet of Things, data centers, and Hadoop. We then focus on the four phases of the value chain of big data, i.e., data generation, data acquisition, data storage, and data analysis. For each phase, we introduce the general background, discuss the technical challenges, and review the latest advances. We finally examine the several representative applications of big data, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid. These discussions aim to provide a comprehensive overview and big-picture to readers of this exciting area. This survey is concluded with a discussion of open problems and future directions.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. 1.

    Gantz J, Reinsel D (2011) Extracting value from chaos. IDC iView, pp 1–12

  2. 2.

    Fact sheet: Big data across the federal government (2012). http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_3_29_2012.pdf

  3. 3.

    Cukier K (2010) Data, data everywhere: a special report on managing information. Economist Newspaper

  4. 4.

    Drowning in numbers - digital data will flood the planet- and help us understand it better (2011). http://www.economist.com/blogs/dailychart/2011/11/bigdata-0

  5. 5.

    Lohr S (2012) The age of big data. New York Times, pp 11

  6. 6.

    Yuki N (2011) Following digital breadcrumbs to big data gold. http://www.npr.org/2011/11/29/142521910/thedigitalbreadcrumbs-that-lead-to-big-data

  7. 7.

    Yuki NThe search for analysts to make sense of big data (2011). http://www.npr.org/2011/11/30/142893065/the-searchforanalysts-to-make-sense-of-big-data

  8. 8.

    Big data (2008). http://www.nature.com/news/specials/bigdata/index.html

  9. 9.

    Special online collection: dealing with big data (2011). http://www.sciencemag.org/site/special/data/

  10. 10.

    Manyika J, McKinsey Global Institute, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH (2011) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute

  11. 11.

    Mayer-Schönberger V, Cukier K (2013) Big data: a revolution that will transform how we live, work, and think. Eamon Dolan/Houghton Mifflin Harcourt

  12. 12.

    Laney D (2001) 3-d data management: controlling data volume, velocity and variety. META Group Research Note, 6 February

  13. 13.

    Zikopoulos P, Eaton C, et al (2011) Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media

  14. 14.

    Meijer E (2011) The world according to linq. Communications of the ACM 54(10):45–51

    Article  Google Scholar 

  15. 15.

    Beyer M (2011) Gartner says solving big data challenge involves more than just managing volumes of data. Gartner. http://www.gartner.com/it/page.jsp

  16. 16.

    O. R. Team (2011) Big data now: current perspectives from OReilly Radar. OReilly Media

  17. 17.

    Grobelnik M (2012) Big data tutorial. http://videolectures.net/eswc2012grobelnikbigdata/

  18. 18.

    Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L (2008) Detecting influenza epidemics using search engine query data. Nature 457(7232):1012–1014

    Article  Google Scholar 

  19. 19.

    DeWitt D, Gray J (1992) Parallel database systems: the future of high performance database systems. Commun ACM 35(6):85–98

    Article  Google Scholar 

  20. 20.

    Walter T (2009) Teradata past, present, and future. UCI ISG lecture series on scalable data management

  21. 21.

    Ghemawat S, Gobioff H, Leung S-T (2003) The google file system. In: ACM SIGOPS Operating Systems Review, vol 37. ACM, pp 29–43

  22. 22.

    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  23. 23.

    Hey AJG, Tansley S, Tolle KM, et al (2009) The fourth paradigm: data-intensive scientific discovery

  24. 24.

    Howard JH, Kazar ML, Menees SG, Nichols DA, Satyanarayanan M, Sidebotham RN, West MJ (1988) Scale and performance in a distributed file system. ACM Trans Comput Syst (TOCS) 6(1):51–81

    Article  Google Scholar 

  25. 25.

    Cattell R (2011) Scalable sql and nosql data stores. ACM SIGMOD Record 39(4):12–27

    Article  Google Scholar 

  26. 26.

    Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proc VLDB Endowment 5(12):2032–2033

    Article  Google Scholar 

  27. 27.

    Chaudhuri S, Dayal U, Narasayya V (2011) An overview of business intelligence technology. Commun ACM 54(8):88–98

    Article  Google Scholar 

  28. 28.

    Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, Franklin M, Gehrke J, Haas L, Halevy A, Han J et al (2012) Challenges and opportunities with big data. A community white paper developed by leading researches across the United States

  29. 29.

    Sun Y, Chen M, Liu B, Mao S (2013) Far: a fault-avoidant routing method for data center networks with regular topology. In: Proceedings of ACM/IEEE symposium on architectures for networking and communications systems (ANCS’13). ACM

  30. 30.

    Wiki (2013). Applications and organizations using hadoop. http://wiki.apache.org/hadoop/PoweredBy

  31. 31.

    Bahga A, Madisetti VK (2012) Analyzing massive machine maintenance data in a computing cloud. IEEE Transac Parallel Distrib Syst 23(10):1831–1843

    Article  Google Scholar 

  32. 32.

    Gunarathne T, Wu T-L, Choi JY, Bae S-H, Qiu J (2011) Cloud computing paradigms for pleasingly parallel biomedical applications. Concurr Comput Prac Experience 23(17):2338–2354

    Article  Google Scholar 

  33. 33.

    Gantz J, Reinsel D (2010) The digital universe decade-are you ready. External publication of IDC (Analyse the Future) information and data, pp 1–16

  34. 34.

    Bryant RE (2011) Data-intensive scalable computing for scientific applications. Comput Sci Eng 13(6):25–33

    Article  Google Scholar 

  35. 35.

    Wahab MHA, Mohd MNH, Hanafi HF, Mohsin MFM (2008) Data pre-processing on web server logs for generalized association rules mining algorithm. World Acad Sci Eng Technol 48:2008

    Google Scholar 

  36. 36.

    Nanopoulos A, Manolopoulos Y, Zakrzewicz M, Morzy T (2002) Indexing web access-logs for pattern queries. In: Proceedings of the 4th international workshop on web information and data management. ACM, pp 63–68

  37. 37.

    Joshi KP, Joshi A, Yesha Y (2003) On using a warehouse to analyze web logs. Distrib Parallel Databases 13(2):161–180

    MATH  Article  Google Scholar 

  38. 38.

    Chandramohan V, Christensen K (2002) A first look at wired sensor networks for video surveillance systems. In: Proceedings LCN 2002, 27th annual IEEE conference on local computer networks. IEEE, pp 728–729

  39. 39.

    Selavo L, Wood A, Cao Q, Sookoor T, Liu H, Srinivasan A, Wu Y, Kang W, Stankovic J, Young D et al (2007) Luster: wireless sensor network for environmental research. In: Proceedings of the 5th international conference on Embedded networked sensor systems. ACM, pp 103–116

  40. 40.

    Barrenetxea G, Ingelrest F, Schaefer G, Vetterli M, Couach O, Parlange M (2008) Sensorscope: out-of-the-box environmental monitoring. In: Information processing in sensor networks, 2008, international conference on IPSN’08. IEEE, pp 332– 343

  41. 41.

    Kim Y, Schmid T, Charbiwala ZM, Friedman J, Srivastava MB (2008) Nawms: nonintrusive autonomous water monitoring system. In: Proceedings of the 6th ACM conference on Embedded network sensor systems. ACM, pp 309–322

  42. 42.

    Kim S, Pakzad S, Culler D, Demmel J, Fenves G, Glaser S, Turon M (2007) Health monitoring of civil infrastructures using wireless sensor networks. In Information Processing in Sensor Networks 2007, 6th International Symposium on IPSN 2007. IEEE, pp 254–263

  43. 43.

    Ceriotti M, Mottola L, Picco GP, Murphy AL, Guna S, Corra M, Pozzi M, Zonta D, Zanon P (2009) Monitoring heritage buildings with wireless sensor networks: the torre aquila deployment. In: Proceedings of the 2009 International Conference on Information Processing in Sensor Networks. IEEE Computer Society, pp 277–288

  44. 44.

    Tolle G, Polastre J, Szewczyk R, Culler D, Turner N, Tu K, Burgess S, Dawson T, Buonadonna P, Gay D et al (2005) A macroscope in the redwoods. In: Proceedings of the 3rd international conference on embedded networked sensor systems. ACM, pp 51–63

  45. 45.

    Wang F, Liu J (2011) Networked wireless sensor data collection: issues, challenges, and approaches. IEEE Commun Surv Tutor 13(4):673–687

    Article  Google Scholar 

  46. 46.

    Cho J, Garcia-Molina H (2002) Parallel crawlers. In: Proceedings of the 11th international conference on World Wide Web. ACM, pp 124–135

  47. 47.

    Choudhary S, Dincturk ME, Mirtaheri SM, Moosavi A, von Bochmann G, Jourdan G-V, Onut I-V (2012) Crawling rich internet applications: the state of the art. In: CASCON. pp 146–160

  48. 48.

    Ghani N, Dixit S, Wang T-S (2000) On ip-over-wdm integration. IEEE Commun Mag 38(3):72–84

    Article  Google Scholar 

  49. 49.

    Manchester J, Anderson J, Doshi B, Dravida S, Ip over sonet (1998). IEEE Commun Mag 36(5):136–142

    Article  Google Scholar 

  50. 50.

    Jinno M, Takara H, Kozicki B (2009) Dynamic optical mesh networks: drivers, challenges and solutions for the future. In: Optical communication, 2009, 35th European conference on ECOC’09. IEEE, pp 1–4

  51. 51.

    Barroso LA, Hölzle U (2009) The datacenter as a computer: an introduction to the design of warehouse-scale machines. Synt Lect Comput Archit 4(1):1–108

    Google Scholar 

  52. 52.

    Armstrong J (2009) Ofdm for optical communications. J Light Technol 27(3):189–204

    Article  Google Scholar 

  53. 53.

    Shieh W (2011) Ofdm for flexible high-speed optical networks. J Light Technol 29(10):1560–1577

    MathSciNet  Article  Google Scholar 

  54. 54.

    Cisco data center interconnect design and deployment guide (2010)

  55. 55.

    Greenberg A, Hamilton JR, Jain N, Kandula S, Kim C, Lahiri P, Maltz DA, Patel P, Sengupta S (2009) Vl2: a scalable and flexible data center network. In ACM SIGCOMM computer communication review, vol 39. ACM, pp 51–62

  56. 56.

    Guo C, Lu G, Li D, Wu H, Zhang X, Shi Y, Tian C, Zhang Y, Lu S (2009) Bcube: a high performance, server-centric network architecture for modular data centers. ACM SIGCOMM Comput Commun Rev 39(4):63–74

    Article  Google Scholar 

  57. 57.

    Farrington N, Porter G, Radhakrishnan S, Bazzaz HH, Subramanya V, Fainman Y, Papen G, Vahdat A (2011) Helios: a hybrid electrical/optical switch architecture for modular data centers. ACM SIGCOMM Comput Commun Rev 41(4):339–350

    Google Scholar 

  58. 58.

    Abu-Libdeh H, Costa P, Rowstron A, O’Shea G, Donnelly A (2010) Symbiotic routing in future data centers. ACM SIGCOMM Comput Commun Rev 40(4):51–62

    Article  Google Scholar 

  59. 59.

    Lam C, Liu H, Koley B, Zhao X, Kamalov V, Gill V, Fiber optic communication technologies: what’s needed for datacenter network operations (2010). IEEE Commun Mag 48(7):32–39

    Article  Google Scholar 

  60. 60.

    Wang G, Andersen DG, Kaminsky M, Papagiannaki K, Ng TS, Kozuch M, Ryan M (2010) c-through: Part-time optics in data centers. In: ACM SIGCOMM Computer Communication Review, vol 40. ACM, pp 327–338

  61. 61.

    Ye X, Yin Y, Yoo SJB, Mejia P, Proietti R, Akella V (2010) Dos: a scalable optical switch for datacenters. In Proceedings of the 6th ACM/IEEE symposium on architectures for networking and communications systems. ACM, p 24

  62. 62.

    Singla A, Singh A, Ramachandran K, Xu L, Zhang Y (2010) Proteus: a topology malleable data center network. In Proceedings of the 9th ACM SIGCOMM workshop on hot topics in networks. ACM, p 8

  63. 63.

    Liboiron-Ladouceur O, Cerutti I, Raponi PG, Andriolli N, Castoldi P (2011) Energy-efficient design of a scalable optical multiplane interconnection architecture. IEEE J Sel Top Quantum Electron 17(2):377–383

    Article  Google Scholar 

  64. 64.

    Kodi AK, Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance computing (hpc) systems. IEEE J Sel Top Quantum Electron 17(2):384–395

    Article  Google Scholar 

  65. 65.

    Zhou X, Zhang Z, Zhu Y, Li Y, Kumar S, Vahdat A, Zhao BY, Zheng H (2012) Mirror mirror on the ceiling: flexible wireless links for data centers. ACM SIGCOMM Comput Commun Rev 42(4):443–454

    Article  Google Scholar 

  66. 66.

    Lenzerini M (2002) Data integration: a theoretical perspective. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. ACM, pp 233–246

  67. 67.

    Cafarella MJ, Halevy A, Khoussainova N (2009) Data integration for the relational web. Proc VLDB Endowment 2(1):1090–1101

    Article  Google Scholar 

  68. 68.

    Maletic JI, Marcus A (2000) Data cleansing: beyond integrity analysis. In: IQ. Citeseer, pp 200–209

  69. 69.

    Kohavi R, Mason L, Parekh R, Zheng Z (2004) Lessons and challenges from mining retail e-commerce data. Mach Learn 57(1-2):83–113

    Article  Google Scholar 

  70. 70.

    Chen H, Ku W-S, Wang H, Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 51–62

  71. 71.

    Zhao Z, Ng W (2012) A model-based approach for rfid data stream cleansing. In Proceedings of the 21st ACM international conference on information and knowledge management. ACM, pp 862–871

  72. 72.

    Khoussainova N, Balazinska M, Suciu D (2008) Probabilistic event extraction from rfid data. In: Data Engineering, 2008. IEEE 24th international conference on ICDE 2008. IEEE, pp 1480–1482

  73. 73.

    Herbert KG, Wang JTL (2007) Biological data cleaning: a case study. Int J Inf Qual 1(1):60–82

    Article  Google Scholar 

  74. 74.

    Tsai T-H, Lin C-Y (2012) Exploring contextual redundancy in improving object-based video coding for video sensor networks surveillance. IEEE Transac Multmed 14(3):669–682

    Article  Google Scholar 

  75. 75.

    Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 269–278

  76. 76.

    Kamath U, Compton J, Dogan RI, Jong KD, Shehu A (2012) An evolutionary algorithm approach for feature generation from sequence data and its application to dna splice site prediction. IEEE/ACM Transac Comput Biol Bioinforma (TCBB) 9(5):1387–1398

    Article  Google Scholar 

  77. 77.

    Leung K-S, Lee KH, Wang J-F, Ng EYT, Chan HLY, Tsui SKW, Mok TSK, Tse PC-H, Sung JJ-Y (2011) Data mining on dna sequences of hepatitis b virus. IEEE/ACM Transac Comput Biol Bioinforma 8(2):428–440

    Article  Google Scholar 

  78. 78.

    Huang Z, Shen H, Liu J, Zhou X (2011) Effective data co-reduction for multimedia similarity search. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, pp 1021–1032

  79. 79.

    Bleiholder J, Naumann F (2008) Data fusion. ACM Comput Surv (CSUR) 41(1):1

    Article  Google Scholar 

  80. 80.

    Brewer EA (2000) Towards robust distributed systems. In: PODC. p 7

  81. 81.

    Gilbert S, Lynch N (2002) Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News 33(2):51–59

    Article  Google Scholar 

  82. 82.

    McKusick MK, Quinlan S (2009) Gfs: eqvolution on fast-forward. ACM Queue 7(7):10

    Article  Google Scholar 

  83. 83.

    Chaiken R, Jenkins B, Larson P-Å, Ramsey B, Shakib D, Weaver S, Zhou J (2008) Scope: easy and efficient parallel processing of massive data sets. Proc VLDB Endowment 1(2):1265–1276

    Article  Google Scholar 

  84. 84.

    Beaver D, Kumar S, Li HC, Sobel J, Vajgel P et al (2010) Finding a needle in haystack: facebook’s photo storage. In OSDI, vol 10. pp 1–8

  85. 85.

    DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W (2007) Dynamo: amazon’s highly available key-value store. In: SOSP, vol 7. pp 205–220

  86. 86.

    Karger D, Lehman E, Leighton T, Panigrahy R, Levine M, Lewin D (1997) Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of the twenty-ninth annual ACM symposium on theory of computing. ACM, pp 654–663

  87. 87.

    Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst (TOCS) 26(2):4

    Article  Google Scholar 

  88. 88.

    Burrows M (2006) The chubby lock service for loosely-coupled distributed systems. In: Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, pp 335–350

  89. 89.

    Lakshman A, Malik P (2009) Cassandra: structured storage system on a p2p network. In: Proceedings of the 28th ACM symposium on principles of distributed computing. ACM, pp 5–5

  90. 90.

    George L (2011) HBase: the definitive guide. O’Reilly Media Inc

  91. 91.

    Judd D (2008) hypertable-0.9. 0.4-alpha

  92. 92.

    Chodorow K (2013) MongoDB: the definitive guide. O’Reilly Media Inc

  93. 93.

    Crockford D (2006) The application/json media type for javascript object notation (json)

  94. 94.

    Murty J (2009) Programming amazon web services: S3, EC2, SQS, FPS, and SimpleDB. O’Reilly Media Inc

  95. 95.

    Anderson JC, Lehnardt J, Slater N (2010) CouchDB: the definitive guide. O’Reilly Media Inc

  96. 96.

    Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 975–986

  97. 97.

    Yang H-C, Parker DS (2009) Traverse: simplified indexing on large map-reduce-merge clusters. In: Database systems for advanced applications. Springer, pp 308–322

  98. 98.

    Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: parallel analysis with sawzall. Sci Program 13(4):277–298

    Google Scholar 

  99. 99.

    Gates AF, Natkovich O, Chopra S, Kamath P, Narayanamurthy SM, Olston C, Reed B, Srinivasan S, Srivastava U (2009) Building a high-level dataflow system on top of map-reduce: the pig experience. Proceedings VLDB Endowment 2(2):1414–1425

    Article  Google Scholar 

  100. 100.

    Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endowment 2(2):1626–1629

    Article  Google Scholar 

  101. 101.

    Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper Syst Rev 41(3):59–72

    Article  Google Scholar 

  102. 102.

    Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson Ú, Gunda PK, Currey J (2008) Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, vol 8. pp 1–14

  103. 103.

    Moretti C, Bulosan J, Thain D, Flynn PJ (2008) All-pairs: an abstraction for data-intensive cloud computing. In: Parallel and distributed processing, 2008. IEEE international symposium on IPDPS 2008. IEEE, pp 1–11

  104. 104.

    Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 135–146

  105. 105.

    Bu Y, Bill H, Balazinska M, Ernst MD (2010) Haloop: efficient iterative data processing on large clusters. Proc VLDB Endowment 3(1-2):285–296

    Article  Google Scholar 

  106. 106.

    Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S-H, Qiu J, Fox G (2010) Twister: a runtime for iterative mapreduce. In Proceedings of the 19th ACM international symposium on high performance distributed computing. ACM, pp 810–818

  107. 107.

    Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin M, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX Association, pp 2–2

  108. 108.

    Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquin R (2011) Incoop: mapreduce for incremental computations. In: Proceedings of the 2nd ACM symposium on cloud computing. ACM, p 7

  109. 109.

    Murray DG, Schwarzkopf M, Smowton C, Smith S, Madhavapeddy A, Hand S (2011) Ciel: a universal execution engine for distributed data-flow computing. In: Proceedings of the 8th USENIX conference on Networked systems design and implementation. p 9

  110. 110.

    Anderson TW (1958) An introduction to multivariate statistical analysis, vol 2. Wiley, New York

    Google Scholar 

  111. 111.

    Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  112. 112.

    What analytics data mining, big data software you used in the past 12 months for a real project? (2012) http://www.kdnuggets.com/polls/2012/analytics-data-mining-big-data-software.html

  113. 113.

    Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B (2008) KNIME: the Konstanz information miner. Springer

  114. 114.

    Sallam RL, Richardson J, Hagerty J, Hostmann B (2011) Magic quadrant for business intelligence platforms. CT, Gartner Group, Stamford

    Google Scholar 

  115. 115.

    Beyond the PC. Special Report on Personal Technology (2011)

  116. 116.

    Goff SA, Vaughn M, McKay S, Lyons E, Stapleton AE, Gessler D, Matasci N, Wang L, Hanlon M, Lenards A et al (2011) The iplant collaborative: cyberinfrastructure for plant biology. Front Plant Sci 34(2):1–16. doi:10.3389/fpls.2011.00034

    Google Scholar 

  117. 117.

    Baah GK, Gray A, Harrold MJ (2006) On-line anomaly detection of deployed software: a statistical machine learning approach. In: Proceedings of the 3rd international workshop on Software quality assurance. ACM, pp 70–77

  118. 118.

    Moeng M, Melhem R (2010) Applying statistical machine learning to multicore voltage & frequency scaling. In: Proceedings of the 7th ACM international conference on computing frontiers. ACM, pp 277–286

  119. 119.

    Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Record 34(2):18–26

    Article  Google Scholar 

  120. 120.

    Verykios VS, Bertino E, Fovino IN, Provenza LP, Saygin Y, Theodoridis Y (2004) State-of-the-art in privacy preserving data mining. ACM Sigmod Record 33(1):50–57

    Article  Google Scholar 

  121. 121.

    van der Aalst W (2012) Process mining: overview and opportunities. ACM Transac Manag Inform Syst (TMIS) 3(2):7

    Google Scholar 

  122. 122.

    Manning CD, Schütze H (1999) Foundations of statistical natural language processing, vol 999. MIT Press

  123. 123.

    Pal SK, Talwar V, Mitra P (2002) Web mining in soft computing framework, relevance, state of the art and future directions. IEEE Transac Neural Netw 13(5):1163–1177

    Article  Google Scholar 

  124. 124.

    Chakrabarti S (2000) Data mining for hypertext: a tutorial survey. ACM SIGKDD Explor Newsl 1(2):1–11

    Article  Google Scholar 

  125. 125.

    Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1):107–117

    Article  Google Scholar 

  126. 126.

    Konopnicki D, Shmueli O (1995) W3qs: a query system for the world-wide web. In: VLDB, vol 95. pp 54–65

  127. 127.

    Chakrabarti S, Van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw 31(11):1623–1640

    Article  Google Scholar 

  128. 128.

    Ding D, Metze F, Rawat S, Schulam PF, Burger S, Younessian E, Bao L, Christel MG, Hauptmann A (2012) Beyond audio and video retrieval: towards multimedia summarization. In: Proceedings of the 2nd ACM international conference on multimedia retrieval. ACM, pp 2

  129. 129.

    Wang M, Ni B, Hua X-S, Chua T-S (2012) Assistive tagging: a survey of multimedia tagging with human-computer joint exploration. ACM Comput Surv (CSUR) 44(4):25

    Article  Google Scholar 

  130. 130.

    Lew MS, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state of the art and challenges. ACM Trans Multimed Comput Commun Appl (TOMCCAP) 2(1):1–19

    Article  Google Scholar 

  131. 131.

    Hu W, Xie N, Li L, Zeng X, Maybank S (2011) A survey on visual content-based video indexing and retrieval. IEEE Trans Syst Man Cybern Part C Appl Rev 41(6):797–819

    Article  Google Scholar 

  132. 132.

    Park Y-J, Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recommendation. Expert Syst Appl 36(2):1932–1939

    MathSciNet  Article  Google Scholar 

  133. 133.

    Barragáns-Martínez AB, Costa-Montenegro E, Burguillo JC, Rey-López M, Mikic-Fonte FA, Peleteiro A (2010) A hybrid content-based and item-based collaborative filtering approach to recommend tv programs enhanced with singular value decomposition. Inf Sci 180(22):4290–4311

    Article  Google Scholar 

  134. 134.

    Naphade M, Smith JR, Tesic J, Chang S-F, Hsu W, Kennedy L, Hauptmann A, Curtis J (2006) Large-scale concept ontology for multimedia. IEEE Multimedia 13(3):86–91

    Article  Google Scholar 

  135. 135.

    Ma Z, Yang Y, Cai Y, Sebe N, Hauptmann AG (2012) Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In: Proceedings of the 20th ACM international conference on multimedia. ACM, pp 469–478

  136. 136.

    Hirsch JE (2005) An index to quantify an individual’s scientific research output. Proc Natl Acad Sci USA 102(46):16569

    Article  Google Scholar 

  137. 137.

    Watts DJ (2004) Six degrees: the science of a connected age. WW Norton & Company

  138. 138.

    Aggarwal CC (2011) An introduction to social network data analytics. Springer

  139. 139.

    Scellato S, Noulas A, Mascolo C (2011) Exploiting place features in link prediction on location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1046–1054

  140. 140.

    Ninagawa A, Eguchi K (2010) Link prediction using probabilistic group models of network structure. In: Proceedings of the 2010 ACM symposium on applied Computing. ACM, pp 1115–1116

  141. 141.

    Dunlavy DM, Kolda TG, Acar E (2011) Temporal link prediction using matrix and tensor factorizations. ACM Transac Knowl Discov Data (TKDD) 5(2):10

    Google Scholar 

  142. 142.

    Leskovec J, Lang KJ, Mahoney M (2010) Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th international conference on World wide web. ACM, pp 631–640

  143. 143.

    Du N, Wu B, Pei X, Wang B, Xu L (2007) Community detection in large-scale social networks. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis. ACM, pp 16–25

  144. 144.

    Garg S, Gupta T, Carlsson N, Mahanti A (2009) Evolution of an online social aggregation network: an empirical study. In: Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference. ACM, pp 315–321

  145. 145.

    Allamanis M, Scellato S, Mascolo C (2012) Evolution of a location-based online social network: analysis and models. In: Proceedings of the 2012 ACM conference on Internet measurement conference. ACM, pp 145–158

  146. 146.

    Gong NZ, Xu W, Huang L, Mittal P, Stefanov E, Sekar V, Song D (2012) Evolution of social-attribute networks: measurements, modeling, and implications using google+. In: Proceedings of the 2012 ACM conference on Internet measurement conference. ACM, pp 131–144

  147. 147.

    Zheleva E, Sharara H, Getoor L (2009) Co-evolution of social and affiliation networks. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1007–1016

  148. 148.

    Tang J, Sun J, Wang C, Yang Z (2009) Social influence analysis in large-scale networks. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 807–816

  149. 149.

    Li Y, Chen W, Wang Y, Zhang Z-L (2013) Influence diffusion dynamics and influence maximization in social networks with friend and foe relationships. In Proceedings of the sixth ACM international conference on Web search and data mining. ACM, pp 657–666

  150. 150.

    Dai W, Chen Y, Xue G-R, Yang Q, Yu Y (2008) Translated learning: transfer learning across different feature spaces: In: Advances in neural information processing systems. pp 353–360

  151. 151.

    Cisco Visual Networking Index (2013) Global mobile data traffic forecast update, 2012–2017 http://www.cisco.com/en.US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_paper_c11-520862.html (Son erişim: 5 Mayıs 2013)

  152. 152.

    Rhee Y, Lee J (2009) On modeling a model of mobile community: designing user interfaces to support group interaction. Interactions 16(6):46–51

    Article  Google Scholar 

  153. 153.

    Han J, Lee J-G, Gonzalez H, Li X (2008) Mining massive rfid, trajectory, and traffic data sets. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, p 2

  154. 154.

    Garg MK, Kim D-J, Turaga DS, Prabhakaran B (2010) Multimodal analysis of body sensor network data streams for real-time healthcare. In: Proceedings of the international conference on multimedia information retrieval. ACM, pp 469–478

  155. 155.

    Park Y, Ghosh J (2012) A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data. In: Proceedings of the 2nd ACM SIGHIT international health informatics symposium. ACM, pp 445–454

  156. 156.

    Tasevski P (2011) Password attacks and generation strategies. Tartu University: Faculty of Mathematics and Computer Sciences

Download references

Acknowledgments

This work was supported by China National Natural Science Foundation (No. 61300224), the Ministry of Science and Technology (MOST), China, the International Science and Technology Collaboration Program (Project No.: S2014GAT014), and the Hubei Provincial Key Project (No. 2013CFA051). Shiwen Mao’s research is supported in part by the US NSF under grants CNS-1320664, CNS-1247955, and CNS-0953513, and through the NSF Broadband Wireless Access & Applications Center (BWAC) site at Auburn University.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Min Chen.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Chen, M., Mao, S. & Liu, Y. Big Data: A Survey. Mobile Netw Appl 19, 171–209 (2014). https://doi.org/10.1007/s11036-013-0489-0

Download citation

Keywords

  • Big data
  • Cloud computing
  • Internet of things
  • Data center
  • Hadoop
  • Smart grid
  • Big data analysis