Abstract
The term big data is now well understood for its well-defined characteristics. More the usage of big data is now looking promising. This chapter being an introduction draws a comprehensive picture on the progress of big data. First, it defines the big data characteristics and then presents on usage of big data in different domains. The challenges as well as guidelines in processing big data are outlined. A discussion on the state of art of hardware and software technologies required for big data processing is presented. The chapter has a brief discussion on the tools currently available for big data processing. Finally, research issues in big data are identified. The references surveyed for this chapter introducing different facets of this emergent area in data science provide a lead to intending readers for pursuing their interests in this subject.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Zikopoulos, P.C., Eaton, C., deRoos, D., Deutsch, T., Lapis, G.: Understanding Big Data. McGrawHill, New York, (2012)
García, A.O., Bourov, S., Hammad, A., Hartmann, V., Jejkal, T., Otte, J.C., Pfeiffer, S., Schenker, T., Schmidt, C., Neuberger, P., Stotzka, R., van Wezel, J., Neumair, B., Streit, A.: Data-intensive analysis for scientific experiments at the large scale data facility. In: IEEE Symposium on Large Data Analysis and Visualization (LDAV), pp. 125–126 (2011)
O’Leary, D.E.: Artificial intelligence and big data. Intell. Syst. IEEE 28, 96–99 (2013)
Berman, J.J.: Introduction. In: Principles of Big Data, pp. xix-xxvi. Morgan Kaufmann, Boston (2013)
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19, 171–209 (2014)
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Ullah, S.: The rise of “Big Data” on cloud computing: review and open research issues. Inf. Syst. 47, January, 98–115 (2015)
Lusch, R.F., Liu, Y., Chen, Y.: The phase transition of markets and organizations: the new intelligence and entrepreneurial frontier. IEEE Intell. Syst. 25(1), 71–75 (2010)
Chen, H., Chiang, R.H.L., Storey, V.C.: Business intelligence and analytics: from big data to big impact. MIS Quarterly 36(4), 1165–1188 (2012)
Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734-749 (2005)
Chen, H.: Smart health and wellbeing. IEEE Intell. Syst. 26(5), 78–79 (2011)
Parida, L., Haiminen, N., Haws, D., Suchodolski, J.: Host trait prediction of metagenomic data for topology-based visualisation. LNCS 5956, 134–149 (2015)
Chen, H.: Dark Web: Exploring and Mining the Dark Side of the Web. Springer, New york (2012)
NSF: Program Solicitation NSF 12-499: Core techniques and technologies for advancing big data science & engineering (BIGDATA). http://www.nsf.gov/pubs/2012/nsf12499/nsf12499.htm (2012). Accessed 12th Feb 2015
Salton, G.: Automatic Text Processing, Reading. Addison Wesley, MA (1989)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)
Big Data Spectrum, Infosys. http://www.infosys.com/cloud/resource-center/Documents/big-data-spectrum.pdf
Short, E., Bohn, R.E., Baru, C.: How much information? 2010 report on enterprise server information. UCSD Global Information Industry Center (2011)
http://agbeat.com/tech-news/how-carriers-gather-track-and-sell-your-private-data/
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)
Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., Gehrke, J., Haas, L., Han, J., Halevy, A., Jagadish, H.V., Labrinidis, A., Madden, S., Papakon stantinou, Y., Patel, J., Ramakrishnan, R., Ross, K., Cyrus, S., Suciu, D., Vaithyanathan, S., Widom, J.: Challenges and opportunities with big data. CYBER CENTER TECHNICAL REPORTS, Purdue University (2011)
Kasavajhala, V.: Solid state drive vs. hard disk drive price and performance study. In: Dell PowerVault Tech. Mark (2012)
Hutchinson, L.: Solid-state revolution. In: Depth on how ssds really work. Ars Technica (2012)
Pirovano, A., Lacaita, A.L., Benvenuti, A., Pellizzer, F., Hudgens, S., Bez, R.: Scaling analysis of phase-change memory technology. IEEE Int. Electron Dev. Meeting, 29.6.1–29.6.4 (2003)
Chen, S., Gibbons, P.B., Nath, S.: Rethinking database algorithms for phase change memory. In: CIDR, pp. 21–31. www.crdrdb.org (2011)
Venkataraman, S., Tolia, N., Ranganathan, P., Campbell, R.H.: Consistent and durable data structures for non-volatile byte-addressable memory. In: Ganger, G.R., Wilkes, J. (eds.) FAST, pp. 61–75. USENIX (2011)
Athanassoulis, M., Ailamaki, A., Chen, S., Gibbons, P., Stoica, R.: Flash in a DBMS: where and how? IEEE Data Eng. Bull. 33(4), 28–34 (2010)
Condit, J., Nightingale, E.B., Frost, C., Ipek, E., Lee, B.C., Burger, D., Coetzee, D.: Better I/O through byte—addressable, persistent memory. In: Proceedings of the 22nd Symposium on Operating Systems Principles (22nd SOSP’09), Operating Systems Review (OSR), pp. 133–146, ACM SIGOPS, Big Sky, MT (2009)
Wang, Q., Ren, K., Lou, W., Zhang, Y.: Dependable and secure sensor data storage with dynamic integrity assurance. In: Proceedings of the IEEE INFOCOM, pp. 954–962 (2009)
Oprea, A., Reiter, M.K., Yang, K.: Space efficient block storage integrity. In: Proceeding of the 12th Annual Network and Distributed System Security Symposium (NDSS 05) (2005)
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U.: The rise of “big data” on cloud computing: review and open research issues, vol. 47, pp. 98–115 (2015)
Wang, Q., Wang, C., Ren, K., Lou, W., Li, J.: Enabling public auditability and data dynamics for storage security in cloud computing. IEEE Trans. Parallel Distrib. Syst. 22(5), 847–859 (2011)
Oehmen, C., Nieplocha, J.: Scalablast: a scalable implementation of blast for high-performance data-intensive bioinformatics analysis. IEEE Trans. Parallel Distrib. Syst. 17(8), 740–749 (2006)
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Hung Byers, A.: Big data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute (2012)
Chen, C.L.P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)
Marz, N., Warren, J.: Big data: principles and best practices of scalable real-time data systems. Manning (2012)
Garber, L.: Using in-memory analytics to quickly crunch big data. IEEE Comput. Soc. 45(10), 16–18 (2012)
Molinari, C.: No one size fits all strategy for big data, Says IBM. http://www.bnamericas.com/news/technology/no-one-size-fits-all-strategy-for-big-data-says-ibm, October 2012
Ferguson, M.: Architecting a big data platform for analytics, Intelligent Business Strategies. https://www.ndm.net/datawarehouse/pdf/Netezza (2012). Accessed 19th Feb 2015
Ranganathan, P., Chang, J.: (Re)designing data-centric data centers. IEEE Micro 32(1), 66–70 (2012)
Iyer, R., Illikkal, R., Zhao, L., Makineni, S., Newell, D., Moses, J., Apparao, P.: Datacenter-on-chip architectures: tera-scale opportunities and challenges. Intel Tech. J. 11(3), 227–238 (2007)
Tang, J., Liu, S., Z, G., L, X.-F., Gaudiot, J.-L.: Achieving middleware execution efficiency: hardware-assisted garbage collection operations. J. Supercomput. 59(3), 1101–1119 (2012)
Made in IBM labs: holey optochip first to transfer one trillion bits of information per second using the power of light, 2012. http://www-03.ibm.com/press/us/en/pressrelease/37095.wss
Farrington, N., Porter, G., Radhakrishnan, S., Bazzaz, H.H., Subramanya, V., Fainman, Y., Papen, G., Vahdat, A.: Helios: a hybrid electrical/optical switch architecture for modular data centers. In: Kalyanaraman, S., Padmanabhan, V.N., Ramakrishnan, K.K., Shorey, R., Voelker, G.M. (eds.) SIGCOMM, pp. 339–350. ACM (2010)
Popek, G.J., Goldberg, R.P.: Formal requirements for virtualizable third generation architectures. Commun. ACM 17(7), 412–421 (1974)
Andersen, R., Vinter, B.: The scientific byte code virtual machine. In: GCA, pp. 175–181 (2008)
Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. J. Parallel Distrib. Comput. 74, 2561–2573 (2014)
Brewer, E.A.: Towards robust distributed systems. In: Proceeding of 19th Annual ACM Symposium on Principles of Distributed Computing (PODC), pp. 7–10 (2000)
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. In: Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles, SOSP’07, ACM, New York, NY, USA, pp. 205–220 (2007)
Lakshman, A., Malik, P.: Cassandra: a structured storage system on a p2p network. In: SPAA (2009)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)
Apache yarn. http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html
Hortonworks blog. http://hortonworks.com/blog/executive-video-series-the-hortonworks-vision-for-apache-hadoop
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: NSDI’10 Proceedings of the 7th USENIX conference on Networked systems design and implementation, p. 21
Kambatla, K., Rapolu, N., Jagannathan, S., Grama, A.: Asynchronous algorithms in MapReduce. In: IEEE International Conference on Cluster Computing, CLUSTER (2010)
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor system. In: Proceedings of the 13th International Symposium on High-Performance Computer Architecture (HPCA), Phoenix, AZ (2007)
Improving MapReduce Performance in Heterogeneous Environments. USENIX Association, San Diego, CA (2008), 12/2008
Polato, I., Ré, R., Goldman, A., Kon, F.: A comprehensive view of Hadoop research—a systematic literature review. J. Netw. Comput. Appl. 46, 1–25 (2014)
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD (2010)
Phoebus. https://github.com/xslogic/phoebus
Ahmad, Y., Berg, B., Cetintemel, U., Humphrey, M., Hwang, J.-H., Jhingran, A., Maskey, A., Papaemmanouil, O., Rasin, A., Tatbul, N., Xing, W., Xing, Y., Zdonik, S.: Distributed operation in the borealis stream processing engine. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD ‘05, pp. 882–884, ACM, New York, NY, USA (2005)
Andrade, H., Gedik, B., Wu, K.L., Yu, P.S.: Processing high data rate streams in system S. J. Parallel Distrib. Comput. 71(2), 145–156 (2011)
Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: OSDI (2010)
Rapolu, N., Kambatla, K., Jagannathan, S., Grama, A.: TransMR: data-centric programming beyond data parallelism. In: Proceedings of the 3rd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’11, USENIX Association, Berkeley, CA, USA, pp. 19–19 (2011)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys ’07 Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, vol. 41, no. 3, pp. 59–72 (2007)
Wayner, P.: 7 top tools for taming big data. http://www.networkworld.com/reviews/2012/041812-7-top-tools-for-taming-258398.html (2012)
Pentaho Business Analytics. 2012. http://www.pentaho.com/explore/pentaho-business-analytics/
Diana Samuels, Skytree: machine learning meets big data. http://www.bizjournals.com/sanjose/blog/2012/02/skytree-machine-learning-meets-big-data.html?page=all, February 2012
Brooks, J.: Review: Talend open studio makes quick work of large data sets. http://www.eweek.com/c/a/Database/REVIEW-Talend-Open-Studio-Makes-Quick-ETL-Work-of-Large-Data-Sets-281473/ (2009)
Karmasphere Studio and Analyst. http://www.karmasphere.com/ (2012)
IBM Infosphere. http://www-01.ibm.com/software/in/data/infosphere/
Auradkar, A., Botev, C., Das, S., De Maagd, D., Feinberg, A., Ganti, P., Ghosh, B., Gao, L., Gopalakrishna, K., Harris, B., Koshy, J., Krawez, K., Kreps, J., Lu, S., Nagaraj, S., Narkhede, N., Pachev, S., Perisic, I., Qiao, L., Quiggle, T., Rao, J., Schulman, B., Sebastian, A., Seeliger, O., Silberstein, A., Shkolnik, B., Soman, C., Sumbaly, R., Surlaker, K., Topiwala, S., Tran, C., Varadarajan, B., Westerman, J., White, Z., Zhang, D., Zhang, J.: Data infrastructure at linkedin. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 1370–1381 (2012)
Kraft, S., Casale, G., Jula, A., Kilpatrick, P., Greer, D.: Wiq: work-intensive query scheduling for in-memory database systems. In: 2012 IEEE 5th International Conference on Cloud Computing (CLOUD), pp. 33–40 (2012)
Samson, T.: Splunk storm brings log management to the cloud. http://www.infoworld.com/t/managed-services/splunk-storm-brings-log-management-the-cloud-201098?source=footer (2012)
Storm. http://storm-project.net/ (2012)
Sqlstream. http://www.sqlstream.com/products/server/ (2012)
Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: distributed stream computing platform. In: 2010 IEEE Data Mining Workshops (ICDMW), pp. 170–177, Sydney, Australia (2010)
Kelly, J.: Apache drill brings SQL-like, ad hoc query capabilities to big data. http://wikibon.org/wiki/v/Apache-Drill-Brings-SQL-Like-Ad-Hoc-Query-Capabilities-to-Big-Data, February 2013
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of webscale datasets. In: Proceedings of the 36th International Conference on Very Large Data Bases (2010), vol. 3(1), pp. 330–339 (2010)
Li, X., Yao, X.: Cooperatively coevolving particle swarms for large scale optimization. IEEE Trans. Evol. Comput. 16(2), 210–224 (2008)
Yang, Z., Tang, K., Yao, X.: Large scale evolutionary optimization using cooperative coevolution. Inf. Sci. 178(15), 2985–2999 (2008)
Yan, J., Liu, N., Yan, S., Yang, Q., Fan, W., Wei, W., Chen, Z.: Trace-oriented feature analysis for large-scale text data dimension reduction. IEEE Trans. Knowl. Data Eng. 23(7), 1103–1117 (2011)
Spiliopoulou, M., Hatzopoulos, M., Cotronis, Y.: Parallel optimization of large join queries with set operators and aggregates in a parallel environment supporting pipeline. IEEE Trans. Knowl. Data Eng. 8(3), 429–445 (1996)
Di Ciaccio, A., Coli, M., Ibanez, A., Miguel, J.: Advanced Statistical Methods for the Analysis of Large Data-Sets. Springer, Berlin (2012)
Pébay, P., Thompson, D., Bennett, J., Mascarenhas, A.: Design and performance of a scalable, parallel statistics toolkit. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), pp. 1475–1484 (2011)
Klemens, B.: Modeling with Data: Tools and Techniques for Statistical Computing. Princeton University Press, New Jersey (2008)
Wilkinson, L.: The future of statistical computing. Technometrics 50(4), 418–435 (2008)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining Inference and Prediction, 2nd edn. Springer, Berlin (2009). (egy, Russell Sears, MapReduce online. In: NSDI, 2009)
Jamali, M., Abolhassani, H.: Different aspects of social network analysis. In: IEEE/WIC/ACM International Conference on Web Intelligence, WI 2006, pp. 66–72 (2006)
Zhang, Yu., van der Schaar, M.: Information production and link formation in social computing systems. IEEE J. Sel. Areas Commun. 30(1), 2136–2145 (2012)
Bringmann, B., Berlingerio, M., Bonchi, F., Gionis, A.: Learning and predicting the evolution of social networks. IEEE Intell. Syst. 25(4), 26–35 (2010)
Fekete, J.-D., Henry, N., McGuffin, M.: Nodetrix: a hybrid visualization of social network. IEEE Trans. Visual. Comput. Graph. 13(6), 1302–1309 (2007)
Shen, Z., Ma, K.-L., Eliassi-Rad, T.: Visual analysis of large heterogeneous social networks by semantic and structural abstraction. IEEE Trans. Visual. Comput. Graph. 12(6), 1427–1439 (2006)
Lin, C.-Y., Lynn, W., Wen, Z., Tong, H., Griffiths-Fisher, V., Shi, L., Lubensky, D.: Social network analysis in enterprise. Proc. IEEE 100(9), 2759–2776 (2012)
Ma, H., King, I., Lyu, M.R.-T.: Mining web graphs for recommendations. IEEE Trans. Knowl. Data Eng. 24(12), 1051–1064 (2012)
Lane, N.D., Ye, X., Hong, L., Campbell, A.T., Choudhury, T., Eisenman, S.B.: Exploiting social networks for large-scale human behavior modeling. IEEE Pervasive Comput. 10(4), 45–53 (2011)
Bengio, Y.: Learning deep architectures for ai, Found. Trends Mach. Learn. 2(1),1–1-1–27 (2009)
Seiffert, U.: Training of large-scale feed-forward neural networks. In: International Joint Conference on Neural Networks, IJCNN ‘06, pp. 5324–5329 (2006)
Arel, I., Rose, D.C., Karnowski, T.P.: Deep machine learning—a new frontier in artificial intelligence research. IEEE Comput. Intell. Mag. 5(4), 13–18 (2010)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Andrew, Y. N.: Building high-level features using large scale unsupervised learning. In: Proceedings of the 29th International Conference on Machine Learning (2012)
Dong, Y., Deng, L.: Deep learning and its applications to signal and information processing. IEEE Signal Process. Mag. 28(1), 145–154 (2011)
Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
Simoff, S., Böhlen, M.H., Mazeika, A.: Visual Data Mining: Theory, Techniques and Tools for Visual Analytics. Springer, Berlin (2008)
Thompson, D., Levine, J.A., Bennett, J.C., Bremer, P.T., Gyulassy, A., Pascucci, V., Pébay, P.P.: Analysis of large-scale scalar data using hixels. In: 2011 IEEE Symposium on Large Data Analysis and Visualization (LDAV), pp. 23–30 (2011)
Andrzej, W.P., Kreinovich, V.: Handbook of Granular Computing. Wiley, New York (2008)
Peters, G.: Granular box regression. IEEE Trans. Fuzzy Syst. 19(6), 1141–1151 (2011)
Su, S.-F., Chuang, C.-C., Tao, C.W., Jeng, J.-T., Hsiao, C.-C.: Radial basis function networks with linear interval regression weights for symbolic interval data. IEEE Trans. Syst. Man Cyber.–Part B Cyber. 19(6), 1141–1151 (2011)
Simon, D.R.: On the power of quantum computation. SIAM J. Comput. 26, 116–123 (1994)
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Nielsen, M.A., Chuang, I.L.: Quantum Computation and Quantum Information. Cambridge University Press, Cambridge (2009)
Furht, B., Escalante, A.: Handbook of Cloud Computing. Springer, Berlin (2011)
Schadt, E.E., Linderman, M.D., Sorenson, J., Lee, L., Nolan, G.P.: Computational solutions to large-scale data management and analysis. Nat. Rev. Genet. 11(9), 647–657 (2010)
Sipper, M., Sanchez, E., Mange, D., Tomassini, M., Pérez-Uribe, A., Stauffer, A.: A phylogenetic, ontogenetic, and epigenetic view of bio-inspired hardware systems. IEEE Trans. Evol. Comput. 1(1), 83–97 (1997)
Bongard, J.: Biologically inspired computing. Computer 42(4), 95–98 (2009)
Ratner, M., Ratner, D.: Nanotechnology: A Gentle Introduction to the Next Big Idea, 1st edn. Prentice Hall Press, Upper Saddle River (2002)
Weiss, R., Basu, S., Hooshangi, S., Kalmbach, A., Karig, D., Mehreja, R., Netravali, I.: Genetic circuit building blocks for cellular computation, communications, and signal processing. Nat. Comput. 2, 47–84 (2003)
Wang, L., Shen, J.: Towards bio-inspired cost minimisation for data-intensive service provision. In: 2012 IEEE First International Conference on Services Economics (SE), pp. 16–23 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Exercise
Exercise
-
1.
Define big data. Explain with an example.
-
2.
List the possible sources generating big data.
-
3.
Discuss on usage of big data in different domains?
-
4.
Why is it called “big data a Service”? Justify your answer.
-
5.
What makes big data processing difficult?
-
6.
Discuss on the guidelines for big data processing.
-
7.
Draw an ecosystem for a big data system. Explain functionality of each component.
-
8.
Discuss on hardware and software technology required for big data processing.
-
9.
Make a list of big data tools and note their functionality
-
10.
Discuss on trends in big data research.
Rights and permissions
Copyright information
© 2015 Springer India
About this chapter
Cite this chapter
Mohanty, H. (2015). Big Data: An Introduction. In: Mohanty, H., Bhuyan, P., Chenthati, D. (eds) Big Data. Studies in Big Data, vol 11. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2494-5_1
Download citation
DOI: https://doi.org/10.1007/978-81-322-2494-5_1
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2493-8
Online ISBN: 978-81-322-2494-5
eBook Packages: EngineeringEngineering (R0)