Encyclopedia of Wireless Networks

Living Edition
| Editors: Xuemin (Sherman) Shen, Xiaodong Lin, Kuan Zhang

Big Data Networks

  • Shigeng ZhangEmail author
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-32903-1_102-1

Synonyms

Definition

Big data networks provide a distributed infrastructure for storing, transmitting, and processing a huge volume of data to extract useful information from the data.

Historical Background

The world is generating data much more fast than before. It is predicted that the total data generated in the world doubles every 2 years (Gantz and Reinsel, 2012). The huge volume of data appears in various domains and applications, e.g., large graph or complex graph for social networks (Nie et al., 2017; Strogatz, 2001), protein-protein interaction (PPI) networks (Stelzl et al., 2005) in bioinformatics, huge number of searching requests in searching engine, and the big data generated from Internet of Things (IoT) devices (Sun et al., 2016).

How to efficiently store and process the huge volume of data is an important technique issue. The idea of leveraging the resources from a set of computers to enhance processing capability and to manipulate a large volume of data can be traced back to early distributed computing systems in 1970s and 1980s (Gligor and Shattuck, 1980). High-performance computing (HPC) (Dowd et al., 1998) and grid computing (Chervenak et al., 2000) played important roles in processing large volume of data since 1990s, and they are still important platforms to perform computational-aware tasks nowadays.

In 2006, Amazon released its elastic computer cloud (EC2) platform (Buyya et al., 2008), which provides a platform to aggregate resources from a set of powerless computers to obtain the required capability to process a large volume of data. This is different from previous approaches based on HPC. In the same year, Google released MapReduce (Dean and Ghemawat, 2008) and Bigtable (Chang et al., 2008), which support parallel processing of big data in a cluster containing a large number of computers and managing a huge number of structured data in a cluster. The Spark framework (Zaharia et al., 2010) improves efficiency by using in-memory computation. Moreover, Spark supports iterative processing which are necessary in many machine learning tasks.

Foundations

Big Data VS. Networks

When the volume of data becomes huge, the data cannot be stored and processed on a single computer, and we have built networking systems to store, transmit, and process the big data. First, as the data exceeds the capability of a single computer node, it is necessary to build a network to connect a large number of nodes storing the data and exchange the data between them efficiently. This stimulates the development of data centers. Second, to process the big data, we need new network computing paradigms to process the data quickly and in parallel, such as MapReduce and Spark.

Big Data Storage

The huge volume of data are usually stored in data centers (DCs) (Al-Fares et al., 2010; Bari et al., 2013). A data center (Bari et al., 2013) is an infrastructure that consists of servers, storage and network devices, and auxiliary facilities such as power distribution systems and cooling systems. The communication infrastructure used in a DC is called a data center network (DCN). In order to improve the efficiency for big data storage and support fast processing of big data, current research in data center networks mainly focus on the following aspects:
  • Topology design: The topologies used in DCNs can be classified into two categories: switch-centric topology and server-centric topology. In switch-centric DCN topologies, the network connection and routing functions are performed on switches, while the servers are mainly used for data storage and computation. Typical switch-centric topologies include Fat-Tree and VL2. In contrast, in server-centric topologies, the function of interconnection and routing are performed on servers, while the switches provide only simple crossbar functions. Typical server-centric topologies include Dcell, BCude, and uFdix.

  • TCP optimization: Due to the special features of DCNs that are different from traditional Internet, performance of traditional TCP congestion control protocols might degrade significantly in some cases. One such feature is TCP Incast (Wu et al., 2013): when a client sends data request to many servers simultaneously and these servers return data to the client at the same time. In such a case, the buffer of switches will overflow, which will greatly degrade the throughput of the network. Researchers propose several techniques to address the TCP Incast problem. Another hot research issue in TCP for DCNs is multipath TCP (MPTCP). In topologies like Fat-Tree and Bcube, there are multiple paths between a pair of hosts. In many cases, MPTCP can significantly increase the data throughput in DCNs.

Processing Platforms

Cloud computing (Armbrust et al., 2010) provides powerful computation capability to process the big data stored in data centers. According to the provided services, cloud computing can be classified into three categories:
  • Infrastructure as a Service (IaaS): In this type of service model, the cloud provides a set of highly scalable and automated computer resources to users in the form of bare hardware. The users can install operating system, deploy applications, or develop programmes by themselves. Examples of IaaS include Amazon AWS and Rackspace. The users can exploit the high scalability of IaaS, which allows them to dynamically scale up resources on demand instead of buying hardware outright.

  • Platform as a Service (PaaS): In this type of service model, the cloud provides a platform with components containing certain software for some applications. PaaS allows the customers to develop, run, and manage their applications on the platform, without building and managing low-layer infrastructure issues as in IaaS. PaaS provides a framework on which the users can create their customized applications. Examples of PaaS include Google App Engine and Windows Azure.

  • Software as a Service (Saas): In this type of service model, the cloud delivers applications to users through Internet. In the SaaS model, the user needs not to download and install applications on their individual computers. The customer can focus on maintenance and support of their application, while the detailed technical issues such as data, servers, storage, and middleware are all handled by the software vendors. Examples of SaaS include Google Apps, Dropbox, and Salesforce.

Big Data Processing Framework

MapReduce (Dean and Ghemawat, 2008) is a programming model for processing parallelizable problems across huge volume of data using a cluster of computers. MapReduce can process both unstructured data stored in a filesystem or structured data stored in a database. It can take advantage of the locality of data, processing the data near the place where the data is stored in order to minimize communication overhead. When using MapReduce, the user specifies a map function and a reduce function. The map function takes a key/value pair as input and generates a set of intermediate key/value pairs. The reduce function merges the intermediate values with the same intermediate key.

A MapReduce program usually contains three steps:
  • Map: In the map step, each worker node applies the map function to its local data and writes the output of the map function to a temporary storage. There is a master node that is in charge of partitioning data and send data to different worker nodes for processing, ensuring that only one copy of redundant input data is processed.

  • Shuffle: In the shuffle step, the worker nodes redistribute data based on the output keys, such that all data belonging to one key are located on the same worker node.

  • Reduce: In the reduce step, the worker nodes process each group of output data and merge the intermediate results in parallel.

There are two limitations of MapReduce. First, it cannot handle iterative algorithms, but a large number of data processing algorithms (e.g., many machine learning algorithms) are iterative. Second, its performance is greatly affected by the I/O operations. To overcome the limitations of MapReduce, the Spark (Zaharia et al., 2010) framework is developed. Compared with MapReduce, Spark significantly improves efficiency by doing in-memory computation. It provides caches to support iterative computation and multiple data sharing, which reduces cost in data I/O. It decreases the cost in writing intermediate result to HDFS by using the directed acyclic graph (DAG) model and reduces redundant sorting and I/O operations by using multi-thread pool. Spark supports more flexible APIs than MapReduce and supports more languages.

Machine Learning for Big Data Processing

Machine learning are a powerful technique to extract useful information or knowledge from big data. For example, face recognition based on deep learning has been successful in many security-related applications in recent years. Machine learning-based big data processing usually requires a large volume of labelled training data, with which we can build very powerful models that sometime even outperform human being in some special tasks. Recently, how to design suitable network structure for machine learning models that run on a large number of computing nodes have attracted researcher’s attention. This includes designing optimized network structure to reduce parameter transmission in learning models and adaptive learning model execution framework to reduce execution latency of the learning model.

Big Data Network Requirements

Different from traditional data processing, big data processing faces many challenges. First, besides structured data, in big data applications, most data are only partially structured or even unstructured. Second, the volume is extremely large, in the magnitude of PB or even EB. Third, the relationship among data is complex and unknown, making it difficult to use existing approaches to extract useful information from the data. As pointed out in Al-kahtani (2016), the challenges pose some requirements on big data networks, some of which are listed as below.
  • Network resiliency: The availability of a network is critical to communication of distributed resources. Path diversity and failover can help improve the resiliency of the networks to failures. Path diversity means to utilize different routes rather than a single fixed route between a pair of communicating nodes. Failover means to have the ability to recognize failures and switch over to an alternate path quickly.

  • Network congestion mitigation: When many big data applications launch at the same time, their data flow might work in a burst mode and might cause problems like Incast. This might cause congestion problems and consequently increase response latency. As previously described, network congestion can be mitigated by designing network topology and optimizing over TCP.

  • Network performance consistency: Because the processing time in big data applications is usually dominated by the computational time rather than data transmission delay, big data networks are not affected a lot by network latency. For big data applications, one more important issue is to maintain high synchronicity, which refers to that different jobs in big data applications need to be executed in parallel in order to assist in accurate analysis. In such cases, any large decrease in network performance may trigger failures in the outcome.

  • Network future scalability: As more and more big data applications will be ported to data centers, it is important for the data center architecture to have the ability of scale-up adaptively according to the requirement of applications.

  • Application awareness: Big data network architecture typically use computer cluster environments such that the nodes can build large data sets. Different applications might pose different requirements on the performance of the data center. For example, some applications might need high bandwidth, but others might focus on low latency. If the network aims to support different tenants’ applications with different requirements, it needs to support, differentiate, separate, and process various workloads with different requirements. When designing a big data network, the designer needs to consider all factors including coexistence issues of data flows, processes, and application resources.

Key Applications

Big data networks have been utilized in various domains to process huge volume of data. Some applications that utilize big data networks are listed below.
  • Key protein complexes identification: How to identify the key protein complexes and their function models from a protein-protein interaction (PPI) network are most computation-intensive problems in the bioinformatics research (Stelzl et al., 2005). The high time complexity of traditional algorithms make them unsuitable for large-scale PPI networks. Currently, MapReduce and Spark have been used to process large-scale PPI networks (You et al., 2014; Harnie et al., 2017).

  • Recommendation systems in E-Business: In E-commerce companies such as Amazon and Alibaba, how to predict the preferences of different customers and make correct recommendation for them is a key issue. Amazon makes recommendation decisions with big data, which significantly increases its profit. Amazon uses item-to-item collaborative filtering to obtain accurate recommendation results, which relies big data network to handle billions of customers and products.

  • Community discovery in social networks: Community detection and discovery in large-scale social networks is another typical application that relies on big data network to process huge datasets. Large social media networks usually have a huge number of users, e.g., Facebook has more than 2 billion daily active users by 2017. In order to efficiently detect and discover communities in such large-scale social networks, MapReduce has been used (Guo et al., 2015).

Cross-References

References

  1. Al-Fares M, Radhakrishnan S, Raghavan B, Huang N, Vahdat A (2010) Hedera: dynamic flow scheduling for data center networks. In: Proceeding of NSDI, pp 89–92Google Scholar
  2. Al-kahtani MS (2016) Big data networking: requirements, architecture and issues. Int J Wirel Mobile Netw 8(6):1–15CrossRefGoogle Scholar
  3. Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I et al (2010) A view of cloud computing. Commun ACM 53(4):50–58CrossRefGoogle Scholar
  4. Bari MF, Boutaba R, Esteves R, Granville LZ, Podlesny M, Rabbani MG, Zhang Q, Zhani MF (2013) Data center network virtualization: a survey. IEEE Commun Surv Tutor 15(2):909–928CrossRefGoogle Scholar
  5. Buyya R, Yeo CS, Venugopal S (2008) Market-oriented cloud computing: vision, hype, and reality for delivering it services as computing utilities. In: 10th IEEE international conference on high performance computing and communications, 2008 (HPCC’08). Ieee, pp 5–13Google Scholar
  6. Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst (TOCS) 26(2):4CrossRefGoogle Scholar
  7. Chervenak A, Foster I, Kesselman C, Salisbury C, Tuecke S (2000) The data grid: towards an architecture for the distributed management and analysis of large scientific datasets. J Netw Comput Appl 23(3):187–200CrossRefGoogle Scholar
  8. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  9. Dowd K, Severance CR, Loukides M (1998) High performance computing-RISC architectures, optimization and benchmarks. O’Reilly, ParisGoogle Scholar
  10. Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze Future 2007(2012):1–16Google Scholar
  11. Gligor VD, Shattuck SH (1980) On deadlock detection in distributed systems. IEEE Trans Softw Eng 6(5): 435–440CrossRefGoogle Scholar
  12. Guo K, Guo W, Chen Y, Qiu Q, Zhang Q (2015) Community discovery by propagating local and global information based on the mapreduce model. Inf Sci 323:73–93MathSciNetCrossRefGoogle Scholar
  13. Harnie D, Saey M, Vapirev AE, Wegner JK, Gedich A, Steijaert M, Ceulemans H, Wuyts R, De Meuter W (2017) Scaling machine learning for target prediction in drug discovery using apache spark. Future Gener Comput Syst 67:409–417CrossRefGoogle Scholar
  14. Nie F, Zhu W, Li X (2017) Unsupervised large graph embedding. In: AAAI, pp 2422–2428Google Scholar
  15. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S et al (2005) A human protein-protein interaction network: a resource for annotating the proteome. Cell 122(6):957–968CrossRefGoogle Scholar
  16. Strogatz SH (2001) Exploring complex networks. Nature 410(6825):268CrossRefGoogle Scholar
  17. Sun Y, Song H, Jara AJ, Bie R (2016) Internet of things and big data analytics for smart and connected communities. IEEE Access 4:766–773CrossRefGoogle Scholar
  18. Wu H, Feng Z, Guo C, Zhang Y (2013) Ictcp: incast congestion control for tcp in data-center networks. IEEE/ACM Trans Netw (ToN) 21(2):345–358CrossRefGoogle Scholar
  19. You Z-H, Yu J-Z, Zhu L, Li S, Wen Z-K (2014) A mapreduce based parallel svm for large-scale predicting protein–protein interactions. Neurocomputing 145:37–43CrossRefGoogle Scholar
  20. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10-10):95Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.School of Computer Science and EngineeringCentral South UniversityChangshaP. R. China

Section editors and affiliations

  • Song Guo
    • 1
  1. 1.Department of ComputingThe Hong Kong Polytechnic UniversityKowloonHong Kong