CCF Transactions on Networking

, Volume 1, Issue 1–4, pp 1–15 | Cite as

Bridging machine learning and computer network research: a survey

  • Yang Cheng
  • Jinkun Geng
  • Yanshu Wang
  • Junfeng Li
  • Dan LiEmail author
  • Jianping Wu
Review Paper


With the booming development of artificial intelligence (AI), a series of relevant applications are emerging and promoting an all-rounded reform of the industry. As the major technology of AI, machine learning (ML) shows great potential in solving network challenges. Network optimization, in return, brings significant performance gains for ML applications, in particular distributed machine learning. In this paper, we conduct a survey on combining ML technologies with network research.


Artificial intelligence Machine learning Computer network Directions and challenges 

1 Introduction

In recent years, there has been a rapid development in artificial intelligence (AI), especially with the breakthrough of machine learning and deep learning. In the field of network, it is generally acknowledged that applying ML techniques to practical problems is potential and promising. Some pioneering works have been carried out to bridge network with ML.

On one hand, ML technologies can be adopted to solve the complicated challenges in network scenario. Instead of laborious human operators, ML techniques are utilized to train a detection model to monitor the performance of network system and identify the misconfiguration and malicious attacks with high efficiency and accuracy. It has been an emerging trend to utilize classic intelligent algorithms, such as Tabu search, deep learning and reinforcement learning, etc., to find the near-optimal solution of NP-hard network resource management problems efficiently. For Quality of service (QoS) optimization, ML technologies can support much more control policies and complicated models than traditional algorithms so as to maintain higher performance of network services.

On the other hand, network optimization can also benefit the ML workflow and bring significant performance gains. As the amount of training data increases rapidly and ML models become more complicated, the computation requirement is beyond the capability of single machine, and thus dozens of distributed machine learning platforms emerge recently. However, the expensive communication cost has caused several bottlenecks for these platforms. Network optimization, such as decentralized topologies, communication compression and network processing offload schemes, has improved the overall performance of these distributed ML platforms.

The combination of ML and network proves to be a tempting direction with much remained for further exploration. Owing that ML & network is a fresh interdisciplinary area and there is currently a lack of systematic review of relevant works in these areas, this paper conducts a comprehensive survey and summarizes the existing works into three main dimensions: machine learning based network management solutions, machine learning based network security & privacy solutions, and network for distributed machine learning. Further, we will focus on several typical scenarios in network and bring forth the future directions and challenges in this interdisciplinary area.

The rest of the paper is organized as follows. Section 2 provides an overview of machine learning based network management solutions. Section 3 reviews the significant works in machine learning based network security and privacy solutions. Section 4 surveys representative works in network for distributed machine learning. We summarize and bring forth the future directions in Sect. 5.

2 Machine learning based network management solutions

Operation & management have always been a major part in network engineering, especially with the booming of cloud computing and development of data center. Since the scale of network has developed to a great extent, it becomes more challenging for network operators to manage such a large network and guarantee the service quality provided to customers. Traditional operating methodology requires a large amount of manual work, which incurs a laborious burden to network operators. The rise of ML techniques has brought new opportunities to free network operators from the heavy workload. Meanwhile, through machine learning techniques, the system performance can be optimized and resources can be better utilized. Specifically, ML techniques have been applied in the following aspects of network operation & management.

2.1 Intelligent maintenance and fault detection

Performance anomalies can damage the service quality provided to customers and it is a critical issue for network operators to detect or prevent anomalies in routine maintenance. Many explorative works have been conducted focusing on the design of anomaly detectors Fontugne et al. (2010), Soule et al. (2005), Li et al. (2006), Yamada et al. (2013), Ashfaq et al. (2010). However, traditional monitoring mechanisms are time-consuming and less effective, which require expert knowledge from operators and involve laborious manual work. Examining this, some novel solutions have been proposed to utilize ML techniques for network system maintenance.

2.1.1 Maintenance with sufficient history data

Given sufficient history data, it is feasible to train a detection model so that it can replace human operators to monitor the performance of network system. Opprentice Liu et al. (2015) is a novel framework to detect performance anomalies with reference to the KPI data. The main objective of Opprentice is to choose suitable detectors and tune them to detect real-time anomalies without the participation of network operators. To attain this, Opprentice constructs a Random Forest Model from the history data accumulated by experienced operators. Human operators can interact with the system periodically to set hyper-parameters and input fresh training data for correction. In most time, Opprentice is able to independently conduct online detection of possible anomalies without the participation of network operators. In this way, the accumulated history data is utilized and the human labor is much reduced. Winnowing Algorithm Lutu et al. (2014) is another successful application in router management to distinguish the unintended limited visibility prefix (LVPs) caused by misconfigurations or unforeseen routing policies. The essence of Winnowing Algorithm is the decision-tree based classification. In the training process, there are many decision trees established based on the labeled data. With the boosted tree model, Winnowing Algorithm is able to distinguish unintended LVPs from the ones, which are the stable expression of intended routing policies to detect the anomalous events in the routing system.

2.1.2 Maintenance without sufficient history data

On the other hand, when there is not enough data to train a suitable model for the practical scenario, transfer learning becomes a considerable choice. The SMART solution Botezatu et al. (2016), proposed by researchers at IBM, focuses on disk replacement in data center and trains a classification model to determine whether the disk should be replaced. Considering the data come from different scenarios with different distributions, they take advantage of transfer learning to eliminate sample selection bias Zadrozny (2004). In this way, the data from different scenarios are transferred to train the ML model and further help the operator with disk replacement.

2.2 Resource management and job scheduling

Resource management and job scheduling have always been hot topics in data center, especially as the scale grows large and the communication becomes more complicated Chowdhury and Stoica (2015), Ma et al. (2016). Usually the problem can be formulated into an NP-hard problem, and past non-AI works adopt simple heuristic methods to solve the problem Ballani et al. (2011), Li et al. (2016), Xie et al. (2012), Zhu et al. (2012). Nowadays there is an emerging trend to utilize classic intelligent algorithm to find better solutions efficiently. Generally speaking, a proper resource management solution focuses on two aspects of demands, i.e. utilization-driven and energy-saving. On one hand, the solution is expected to improve the resource utilization and accelerate the progress. On the other hand, energy consumption should be reduced in the data center, especially in the dynamic scenario that involves VM (or container) migration. Besides utilization-driven and energy-saving objectives, there are also some works pursuing a hybrid objective to reach a better trade-off among various performance metrics.

2.2.1 Utilization-driven solution

Libra Ma et al. (2016) is a representative work focusing on the first aspect and it aims to maximize isolation guarantee with an optimal placement for parallel applications in the data center. In order to maximize isolation guarantee, Tabu search is adopted in Libra to help find an optimal solution for container placement at an affordable computation cost. Besides the traditional intelligent algorithms such as Tabu search, novel techniques including deep learning and machine learning, are also applied in resource management and job scheduling. DeepRM Mao et al. (2016) is a pioneering work in adopting deep reinforcement learning technology to manage resources in network system. It defines average job slowdown to quantity the normalized progress rate of jobs and aims to minimize the metric with a standard policy gradient algorithm. DeepRM proves the feasibility of Deep RL techniques in resource management problems, and motivates fresh ideas for following research.

2.2.2 Energy-saving solution

Compared with Libra and DeepRM, MadVM Zheng et al. (2013) imposes its emphasis on the second aspect and it aims to reduce the energy consumption during VM management. The main idea of MadVM is to approximate the practical scenario of VM migration with Markov Decision Process (MDP). Under the framework of MDP, the objectives for the MDP problem is quantified with reference to both power consumption and resource shortage. Some approximation tricks are integrated to reduce high dimensions and the MDP is able to provide a near-optimal solution with energy efficiency.

2.2.3 Hybrid objective based solution

Instead of focusing on one objective, some solutions cover more aspects in the management.SmartYarn Xu et al. (2016) starts from the point that different resource configurations may lead to similar performance, therefore, an optimal configuration of multi-resources is expected to provide desired service quality, as well as save much cost. To attain this, reinforcement learning is adopted in SmartYarn with consideration of both service-level agreement and cost efficiency. Based on the usage-based pay-for-resources model, the cost efficiency is quantified subjected to the constraints of performance requirement. Then SmartYarn adopts the popular reinforcement learning algorithm, Q-learning, to solve the problem and achieve the approximated optimization efficiently. A similar work is also implemented with Q-learning to balance QoS revenue and power consumption in geo-distributed data centers Zhou et al. (2016). The objective function quantities QoS revenue and power consumption with a simple weighted sum, besides, optimization techniques are integrated to accelerate the solving computation. By tuning the hyperparameter in the weighted sum, adaptive policies can be generated from the RL model to cater to various services in geo-distributed data centers.

Moreover, there are also some research works focusing on the prediction of future situation to determine the current resource management solutions. For example, the user demand can be modeled with neural network classifiers, then adaptive solutions are generated to determine the resource configuration and job scheduling in data center Bao et al. (2016). Recent works, such as Tan et al. (2017), Zhang et al. (2017), He et al. (2013), adopt online learning techniques to predict future workload and reduce the cost of time and resource.

2.3 Service performance optimization

AI&ML can be effective tools to optimize the performance of various applications for better network services, such as video streaming services, web searching services, content delivery services and so on.

2.3.1 Video service optimization

The video quality can be affected by many factors during the delivery Jiang et al. (2016) and there have been many prior works targeting at a better user experience. Since the network quality can be fluctuating, adaptive bitrate (ABR) algorithms receive lots of concerns because it aims to select the proper bitrate based on the changing network conditions. CS2P Sun et al. (2016) improves the bitrate adaptation based on the network throughput. Inspired by the similarity of throughput patterns among different video sessions, CS2P establishes a Hidden-Markov-Model (HMM) to predict the throughput and further execute adaption decisions. The HMM model proves to be effective in throughput prediction and contributes to a better quality of experience (QoE). However, most ABR algorithms can hardly adapt to a broad range of network conditions and objectives due to its fixed control rules and simplified models. Therefore, Pensieve Mao et al. (2017) resorts to reinforcement learning for bitrate adaptation. Instead of fixed control policies, Pensieve establishes a neural network and learns the bitrate control policies automatically. In this way, Pensieve outperforms the state-of-art ABR algorithms under a variety of network conditions and QoE objectives.

2.3.2 Web search optimization

Apart from the quality management of video streaming, web search is another hot field in QoS optimization. From the perspective of user experience, response time is a key attribute to consider during the QoS optimization for web services. The optimization towards high search response time (HSRT) is challenging with a heavy workload. Such a situation motivates the combination of machine learning techniques and HSRT optimization. FOCUS Liu et al. (2016) is the first work in this direction and it adopts decision tree model to learn the domain knowledge automatically from search logs and identify the major factors responsible for HSRT conditions.

2.3.3 CDN service optimization

PACL (Privacy-Aware Contextual Localizer) Das et al. (2014) adopts similar technique as FOCUS, but it aims to learn users contextual location and further improve the quality of content delivery. PACL models the mobile traffic with a decision tree, together with pruning techniques. Then the most significant attributes are identified to imply user location contexts. With the predicted context information, PACL is able to choose the nearest CDN node for content delivery, thus reducing the waiting time in data transfer.

2.3.4 Congestion control optimization

Machine learning can also be integrated into TCP congestion control (CC) mechanism to improve the network performance. For example, it has been used to classify congestive and non-congestive loss Jayaraj et al. (2008), forecast TCP throughput Mirza et al. (2007), and for better RTT estimation Nunes et al. (2011). Remy Winstein et al. (2013) formalizes the multi-user congestion control problem as an MDP and learns the optimum policy offline. It needs intense offline computation and the performance of the RemyCCs depends on the accuracy of the network and traffic models. PCC Dong et al. (2015) adaptively adjusts its sending rate based on continuous profiling, but it is entirely rate-based and its performance depends on the accuracy of the clocking. The learnability of TCP CC was examined in Sivaraman et al. (2014), where RemyCC was used to understand what kinds of imperfect knowledge on network model would hurt the learnability of TCP CC more than others. Q-learning based TCP Li et al. (2016) is the first attempt (that we know of) that uses Q-learning to design the TCP CC.

2.4 Traffic analysis

Traditional clustering and classification methods are widely applied in earlier works aim to find valuable information from the large amount of data packets Santiago et al. (2012), Baralis et al. (2013), Franc et al. (2015), Bartos et al. (2016), Xu et al. (2015), Antonakakis et al. (2012). Via clustering and classification, similar patterns are mined out among data packets, which are helpful for applications such as security analysis and user profiling. The combination of classification and traffic analysis still remains as a hot topic in recent years, nevertheless, some other machine learning algorithms also come into use for traffic analysis.

2.4.1 Natural language processing for traffic analysis

Proword Zhang et al. (2014) leverages natural language processing technique in protocol analysis: First, Proword designs Voting Experts (VE) algorithm to select the most possible boundary positions for word partitioning. Based on the candidate feature words extracted by the VE algorithm, Proword tries to mine out the protocol features from these words. The candidate words are ranked with pre-defined score rules and the top k of them serve as the feature words. The combination of NLP and protocol analysis demonstrates a higher accuracy over traditional protocol analysis methods.

2.4.2 Exploratory factor analysis for traffic analysis

Exploratory factor analysis (EFA) emerges as a novel factor analysis technique, and it is believed to be more effective than traditional principal component analysis (PCA) techniques for multivariate analysis. The recent work Furno et al. (2017) adopts EFA technique to mobile traffic data and bridge the temporal and spatial structure analysis. This work fills the gap in this joint area with better or equal results compared to state-of-art solutions.

2.4.3 Transfer learning for traffic analysis

Transfer learning also contributes to the traffic analysis and security threat detection in Bartos et al. (2016), which proposes a classification system to detect both known and previously unseen security threats. Since there may be biases between the training domain and target domain, the knowledge acquired from traditional training model cannot be directly applied to the target cases. Through transfer learning technique, the feature values are transformed into an invariant representation for domain adaptation. Such an invariant representation is integrated into the classification system, which helps to categorize traffic into malicious or legitimate classes.

Traffic analysis perhaps is the closest area related to machine learning techniques and it involves lots of efforts from computer network researchers. The broad applications of traffic analysis lie in the security research, such as user profiling and anomaly detection. We will further discuss these aspects in the following section.

3 Machine learning based network security and privacy solutions

Network security and privacy are a broad area covering a range of issues, including anomaly detection, user authentication, attack defense, privacy protection and other aspects. With the explosion of network data in recent years, traditional methodologies are confronted with more challenges in detection and defense of emerging attacks. Inspired by the success on the traditional area, researchers try to use classic ML methodology(e.g. Naive Bayesian Nandi et al. (2016), KNN Wang et al. (2014), regression Nandi et al. (2016), decision tree Zheng et al. (2014), Soska and Christin (2014), SVM Franc et al. (2015), random forest Hayes and Danezis (2016)) to solve a series of complicated security problems.

3.1 Anomaly detection on cloud platform

Anomaly such as the misconfiguration and vulnerable attack have a great impact on the system security and stability, the detection can be very challenging without sufficient knowledge of the system conditions. In most cases, detection requires analyzing a large amount of system log, which is not possible for human operators. Machine learning techniques thus provide alternative ways for manual work.

Since the cloud platform is more and more popular, many attackers are targeting on how to steal the rich resource of cloud platform, thus, how to detect the anomaly behavior on a cloud platform is arising a wide concern of researchers. FraudStormML Neuvirth et al. (2015) adopts supervised regression methods to detect fraudulent use of cloud resources in Microsoft Azure. It can detect the fraud storms with reference to their utilization of resources like the bandwidth and computation unit, thus raising early alerts to prevent the illegal consumption of cloud computing resources. APFC\(^3\) Zhang et al. (2014) is another system developed with a hierarchical clustering algorithm to automatically profile the maximum capacities of various cross-VM covert channels on different cloud platforms. Convert channel may cause information leakage and be utilized by hackers to threaten system security.

The maturity of NLP techniques brings new ideas to anomaly detection and it proves to be an efficient way to detect the anomalous issues based on their execution logs. Nandi et al. Nandi et al. (2016) follow this idea and constructs graph abstraction with the execution logs: the executions in the distributed applications are abstracted as nodes and workflows are abstracted as edges to connect them. Naive Bayesian and linear regression models are used to detect whether there is any anomaly hidden in the graph. Such a mapping from anomaly detection to graph mining gains a good performance in efficiency and scalability.

In most of their works Neuvirth et al. (2015), Zhang et al. (2014), classic algorithms are trained by the system logs and other KPI, so the key factor to get more precise results is how to find a well-organized set of features, it seems to be poor generalized. More precise result and generalization can be gained from another view Nandi et al. (2016).

3.2 Authentication and privacy protection

Authentication guarantees user privacy and prevents information leakage. The user identity is supposed to be authenticated with a reliable mechanism, and then the user is granted authorization for access.

The popularity of mobile devices causes more problems for user authentication and privacy protection. On one hand, numerous applications on the phone require collecting user’s behavior data for better service, but the collected information may be exploited by adversaries, further incurring leakage and threatening user privacy. On the other hand, mobile users may be unaware of information protection so that their authentications may be stolen by others easily. Complex password and/or secondary verification mechanism can reduce the risk of authentication attacks, but also bring significant inconvenience to the users.

Observed the difference on behaviors from different user Zheng et al. (2014) designed a novel authentication mechanism to improve the privacy by profiling a user behavior according to their habit while use smartphone. More specifically, they model the nearest neighbor distance and a decision score based on the features(e.g. acceleration, pressure, size and time) to judge whether the behaviors belong to the same user and whether the authentication should be granted.

Besides analysis on the user side, analysis on the adversaries’ side is also worthwhile. It has been observed that the behavior between user and adversaries keep changing dynamically Wang and Zhang (2014). The changes of user’s behavior and adversaries action can be modeled with a two-state Markov chain for inferring user’s behavior. The action between user and adversary is generalized as a zero-sum game: the user tries to change his behavior making the adversary fail to predict their next state, whereas the adversary tries to predict user’s behavior by adjusting their strategies. The zero-sum game can be solved by a minimax learning algorithm with provable convergence to obtain the optimal strategies for users to protect their privacy against the malicious adversaries.

Fingerprinting technique is regarded as an effective method for behavior identification even under the circumstances with encryption, which can be utilized by both defender and attackers. AppScanner Taylor et al. (2016) is a typical work which uses automatic fingerprinting technique to identify android apps based on the analysis of network traffic generated by the apps. It adopts the SVM classifier and random forest to establish its identification model. By using flow characteristics such as the packet length and traffic direction, AppScanner is able to identify sensitive apps, which may cause attacks. Since AppScanner focuses on the statistical characters, it works well against encryption. In addition, fingerprinting attack is seen as a serious threat to online privacy.

The privacy is a big concern with the popularity of mobile device, more knowledges on the user behavior help the service provider offering a better User Experience, it can also be exploited by malicious attackers to put the owner in danger. So it is more like an adversarial game, many works mentioned before bringing novel ideas to solve these scenarios. In general, how to figure out the malicious behaviors while keeping a high user experience should be taken into consideration.

3.3 Web security and attack detection

Web security is another rigorous issues today, and many websites are suffering attack caused by the vulnerability and some other factors. The recent k-fingerprinting Hayes and Danezis (2016) attack employs website fingerprinting technique to launch an attack even confronted with a large amount of noisy data and encrypted traffic. It adopted a random decision forest to construct the website fingerprinting and trains it by a set of features (such as the burst or packet length instead of plain text) and proves to be an efficient methodology on identifying which websites the victim is visiting based on history datum, which can be used in further attack behaviors.

How to identify the malicious websites is arising a great interest from academic. In general, traffic inspection serves as a common and effective method for malicious behavior identification. The recent Soska and Christin (2014) proposes a general approach for predicting websites propensity to become malicious in the future. It adopts C4.5 decision tree trained by Relevant features (e.g. distributed from traffic statistic, file system webpage structure and contents) to identify whether the page is going to be malicious.

However, the increasing variety of network applications and protocols, as well as the widely using encryption techniques, makes it more challenging to identify malicious behavior from the huge mass of traffic. Meanwhiles, many relevant works have brought novel ideas on how to train model with limited dataset and make ML more powerful in web security issues.

The recent work Franc et al. (2015) is focused on how to learn the traffic detector from weak labels, which adopt a novel Neyman Pearson model to identify the malicious domains by learning from a public-available blacklist on malicious domains. Different to traditional methods, it can learn efficiently from a weak label dataset by a MIL (multiple instance learning) framework, the main idea is that: it firstly extracts features from the proxy log, and analyzes the correlation between a huge amounts of unlabeled data with a small fraction of labeled data, further deriving the weak labels for them. Which give an inspiration on how to learn from weakly labelled dataset.

Apart from the deficiency of labeled data for training, another problem in malicious behavior detection lies in the difficulty of understanding the data representation, since attackers may hide their behaviors with traffic obfuscation to escape being tracked. Confronted with such a problem, a robust representation suite is proposed in Bartos et al. (2016) for classifying evolving malicious behaviors from obfuscated traffic. It groups sets of network flows into bags and represents them with a combination of feature values and feature differences. The representation is designed to be resilient to feature shifting and scaling and oblivious to bag permutation and size changes. The proposed optimization method learns the parameters of the representation automatically from the training data (SVM based to learn the number and size of bin of historical graph), allowing the classifiers to create robust models of malicious behaviors capable of detecting previously unseen malicious variants and behavior changes.

Unlike the malicious obfuscation, traffic encryption, which is usually applied for privacy protection, also causes great challenges to identify the malicious flow stream. To analyze those malicious traffic flow with encrypted payload, the meta-information is used in Comar et al. (2013), such as packet length and time interval, to train the model. A two-level framework is constructed combining an existing IDS and the self-developed SVM algorithm, which proves to be effective to identify malicious traffic from tremendous flow data.

The methods mentioned above may not work so well compared to raw traffic analysis, limited by the obfuscation and encryption, some more advanced methodology may be proposed further to cater to those scenarios, on the other hand, many works are trying to remedy the limited condition from other views.

SpiderWeb Stringhini et al. (2013) offers a new approach to detect malicious web pages by using the redirection graphs. It collects HTTP redirection data from a large and diverse collection of web users and aggregates the different redirection chains that lead to a specific web page, and then it analyzes the characteristics of the redirection graph, extract 28 features to represent the redirection graph. By inputting these features into the SVM classifier, SpiderWeb is able to identify the malicious web page more accurately than previous methods.

More deeply, MEERKAT Borgolte et al. (2015) brings a novel approach based on the “look and feel” of a website to identify if the website has been defaced. Different from previous works, MEERKAT leverages recent computer vision techniques and directly takes the snapshot as input. It can automatically learn high-level features from data directly and does not rely on additional information supplied by the website’s operator. MEERKAT employs a stacked autoencoder neural network to “feel” the high-level features. The features are extracted by the machine automatically and input into a feedforward neural network to identify the defaced websites.

Web security is rigorous and harder to solve by ML-based methodology, confronting with the obfuscation and encryption technologies. On the one hand, researchers can develop more advanced model catering to such harsh condition, meanwhiles, some novel angles from other areas may bring new chances.

3.4 Barriers in ML-based security research

Researchers have focused on applying classic ML methodology to solve the security issues and get a great success. However, more barriers still remain in this area and here we summarize three main aspects as follows.

Challenge of model design Network security issues are mainly analyzed on the basis of traffic traces and system logs. Current research simply borrows the models from other areas (such as computer vision and pattern recognition) to the security scenario. However, it remains as a key concern how to design more effective models to mine the complex relationship hidden in these data, Wang (2015) brings a novel idea on how to map network traces to other areas, it maps traffic to a bitmap and adopts autoencoder to distinguish different traffics inspired by CV, it works well on unencrypted raw network traffic, however not suitable for the encrypted one, other work [Nandi et al. (2016) map system logs to DAG problems] also gives inspiration on how to map network security issues to traditional area. How to design effective model catered to network security scenarios is deserved to be deeply explored in future.

Lack of training dataset In traditional area (e.g. CV, NLP and speech recognition, etc.), many datasets (e.g. ImageNet, MNIST, SQuAD, Billion Words, bAbi, TED-LIUM, LibriSpeech, etc.) are public to academia and industry, researchers can get access to those resources and develop more advanced model. In the security area, many factors (such as political and commercial concerns) constrain the access to ground-truth network dataset, thus making it even harder to apply machine learning technologies to solve the network security issues.

Adversarial and game theory The security problem is more like a competitive game with many factors involved. On the one hand, defenders are trying to get more precise results by tuning parameter and advanced model, one the other hand, attackers are trying to obfuscate their malicious behavior with normal data. This inverses relationship between defender and attacker makes ML-based security issues more complicated since ML techniques can be leveraged by both the attackers and defenders.

In summary, ML possesses potential power in security research. However, more factors need to consider and there are still many open problems remained in this area.

4 Network for distributed machine learning platform

The rapid development of the Internet has led to the explosion of business data amount as well as the increasing complexity of training models. The time-consuming training process and heavy workload make it even impossible to undertake these tasks on one single machine, therefore, distributed computing becomes an alternative way to consider. In recent years, there have been some representative works conducted towards distributing machine learning platforms, such as Hadoop hadoop (2009), Spark Zaharia et al. (2010), GraphLab Low et al. (2014), DistBelief Dean et al. (2012), Tensorflow Abadi et al. (2016), MXNet Chen et al. (2015), etc. Generally speaking, there are a couple of major issues to concern during the construction of an efficient distributed machining learning platform: (1) network topology (2) parallelism and synchronization (3) communication and scalability, etc.

4.1 Network topology for distributed machine learning

The architecture design of the distributed platform can impose significant impacts on the execution efficiency and overall performance; meanwhile, it has a close relationship to other issues, such as fault tolerance and scalability. So far, there has been two types of architectural prototypes proposed, i.e. Parameter Server-based (PS-based) architecture and Ring-based architecture.

4.1.1 PS-based architecture

Fig. 1

Parameter server architecture

PS-based architecture Chilimbi et al. (2014) is illustrated as Fig.  1. In the PS-based design, the machines (or nodes1) are organized in a centralized way and there is a functional difference among them. Some nodes work as parameter servers (PSs) whereas the others work as workers. PSs are responsible for managing the parameters of the model and coordinating the execution of workers. Workers undertake the training tasks and submit their updated parameters to PSs periodically.

Take the gradient descent (GD) algorithm as an example, which is a common iterative method adopted for training neural networks. Under the framework of PS-based architecture, each worker will hold a part of the training data. In the beginning, workers will pull the model parameters from PSs, and then train its model independently with its local data. After a certain number of computation iterations, workers will push their new calculating gradients to PSs, which will aggregate these gradients to update the whole model. Workers will then again pull the updated model and continue their training processes.

4.1.2 Ring-based architecture

Unlike PS-based architecture, which follows a centralized principle and keeps functional differences among nodes, Ring-based ringspshpc (2017) architecture regards each node equally in logic and works in a complete decentralized way.
Fig. 2

Ring-based architecture

As illustrated in Fig.  2, the nodes are organized as a ring and each node in the ring works in the same way. Compared with PS-based architecture, the node in Ring-based architecture does not need to pull/push their parameters to/from a central server, instead, they just communicated with their two neighbors directly connected. Under the framework of Ring-based architecture, each node works iteratively and the execution can be decomposed into two steps, scatter and gather.

For simplicity, we use a 6-node ring to illustrate the execution of scatter and gather (Fig.  3). Similar to the workers in PS-based architecture, each node in the ring holds a part of the training data. During each iteration, the 6 nodes in the ring run independently and compute the parameters for the training model. Then they will execute scatter and gather procedure to synchronize the parameters. After these steps, each of them will gain the same parameters and continue to execute the next iteration.

Scatter As shown in Fig.  2, all the six workers will hold one copy of the training model and compute the model parameters parallel. Then after each of them has computed their local parameters, the scatter procedure is triggered for each node to exchange its parameters with neighbors. Each node will evenly split its local parameters into several shares and the number of shares equals the number of nodes. In this scenario, each node divides its local parameters into 6 parts and transfers one part for each scatter.
Fig. 3

Steps in scatter stage

In this case, it can be implied that after 5 times of scatter, each node will possess one part of global parameters, which sums up the original parameters from each node correspondingly. The scatter procedure completes and it comes to the gather procedure to synchronize the parameters for each node.

Gather Since each node possesses only one part of global parameters (1/6 parameters for each node in this scenario) after the scatter procedure, the gather procedure tries to synchronize the parameters. Similar to scatter, each node will send its global parameters to its right neighbour and receive the global parameters from its left neighbour. As shown in Fig.  4, during the first gather ,\(node_{0}\) receives the global parameters from \(node_{5}\) and passes its own global parameters to \(node_{1}\). During the second gather, \(node_{1}\) receives the global parameters of \(node_{4}\) indirectly via \(node_{5}\), meanwhile,\(node_{0}\) also passes the fresh global parameters to \(node_{1}\), which has been received from \(node_{5}\) in the last gather.
Fig. 4

Steps in gather stage

Similar to the scatter procedure, after 5 gathers, the parameters on each node will be synchronized and each node will hold the same global parameters for the next iteration.

4.1.3 Comparison between PS-based and ring-based architecture

PS-based architecture decouples the execution into model training and parameter synchronizing, which are undertaken by workers and PSs correspondingly. Such a design strengthens the robustness: when failures occur in several workers, the system will maintain a graceful performance degradation Guo et al. (2009). However, when it comes to Ring-based architecture, even one single node fails, the entire ring will collapse and the performance will degrade sharply.

However, since PS-based architecture requires the centralized servers to synchronize the parameters for workers, the communication between servers and workers can become potential bottlenecks. Especially when there are much more workers than servers, the high concurrency caused by workers will bring much pressure to servers and further affect the overall performance. On the contrary, Ring-based architecture follows a decentralized principle and each node shares the workload (both computation and communication) evenly. Compared with PS-based architecture, there is no significant communication bottleneck as the number of nodes reaches a large scale.

Besides, it is contradictive to compare the two architectures in scalability. As for PS-based architecture, fresh workers can join the current system without much effect on other workers. On the other hand, in order to deploy more nodes into Ring-based architecture, the original ring has to be changed and the logic of scatter and gather on existing nodes will be rearranged. From this perspective, the scalability of Ring-based architecture is not so good as PS-based architecture. However, compared with Ring-based architecture, PS-based architecture consumes more machines and more links, which can restrict its scalability in large scale. Ring-based architecture requires no centralized nodes and takes a full utilization of each machine. From this perspective, it scales better than PS-based architecture. In short, the comparison on scalability should refer to the specific scenarios as well as the key constraints. Either architecture can be a better option to choose.

4.2 Parallelism and synchronization

Parallelism is a key factor that helps to accelerate the training process for distributed machine learning platforms. Generally speaking, there are two main types of parallelism modes to concern, i.e. data parallelism and model parallelism. Confronted with different business models, each platform has its own emphasis on the parallelism modes.

4.2.1 Data parallelism

Data parallelism is one common parallelism mode for distributed systems. where each node works with the same training model, in other words, each node holds a complete copy of the model parameters, The differences between nodes mainly lie in the training data. Confronted with a large amount of training data, each node can only hold one part and conduct training process with the training data stored in its own machine. After a certain period, each node communicates with each other to synchronize their model parameters for further training. As mentioned, different architectures (PS-based or Ring-based) use different synchronization mechanisms. The data parallelism has been applied in a wide range and most distributed machine learning platforms support data parallelism, such as Tensorflow Abadi et al. (2016), MXNet Chen et al. (2015), Li et al. (2014a), Petuum Xing et al. (2015), etc.

4.2.2 Model parallelism

Some practical businesses may require a huge training model containing billions of parameters, which is too difficult for a single machine to solve. Model parallelism tries to split the model parameters into several sections and reduce the intersection between each part. In this way, some training steps in the training models can be executed independently with their own parameters and the overall efficiency can be much improved by parallelizing these steps. GraphLab Low et al. (2014) and \(Tux^{2}\) Xiao et al. (2017) are considered to be two typical works in this direction, both integrated with novel splitting techniques to support model parallelism. They both adopt innovative model decomposition strategies: GraphLab takes vertex as the minimum granularity and distributes the vertexes into different nodes. The edges reflect the dependency between vertexes and multiple copies of edges are stored to avoid the loss of dependency information. \(Tux^{2}\), on the other hand, cut vertexes and replicate them into several copies stored in several nodes. Such a design proves to be effective to handle power-law graphs and better match PS-based architecture.

4.2.3 Synchronization and asynchronization

Synchronization and asynchronization is a concerning issue in either Parallelism mode: Synchronization can sometimes cause serious communication costs. Asynchronization, on the other hand, will lead to a frustrating result and incur more iterations. Since either mechanism is not perfect, there are some strategies proposed to combine the benefits of the two mechanisms.

K-bounded delay Li et al. (2014a) can be regarded as a trade-off of synchronization and asynchronization in model updates. It relaxes the synchronization constraints and allows the fastest worker node can surpass the slowest one for no more than K rounds. Only when the gap goes beyond K rounds, the fastest node will be blocked for synchronization. K is a user-defined hyperparameter and varies in different models. In particular, when K is set to zero, the K-bounded delay mechanism turns to the synchronous one. When K is infinite, the K-bounded delay mechanism turns to the asynchronous one.

4.3 Communication optimization

The poor performance of distributed system has much to do with the communication latency. This is even more distinctive with the popularity of GPU-based computing such as NVIDIA nvidia et al. (2017) and AMD amd (2017). GPU is efficient for parallel computing with a lot of computing cores, which is in great need of an efficient communication mechanism. The increasing training scale can incur expensive communication cost and cause severe bottlenecks for the platform. To mitigate the communication bottleneck and achieve a satisfactory performance, there are some tricks worth considering.

4.3.1 Efficient communication protocols

As one of the most typical communication protocols, TCP is widely applied to distributed machine learning systems. However, the drawbacks seriously damage the system performance, such as slow start, naive congestion control and high latency. Inspired by this, many researchers try to introduce more efficient communication protocol to improve the scalability and distribution, including RDMA, GPUDirect RDMA, NVLink, etc.

Remote Direct Memory Access (RDMA) Archer and Blocksome (2012) is another high-performance communication protocol, which is aimed to access memory information on the other machine directly. RDMA can minimize the overhead of processing packet and latency with the assist of dependable protocol implemented on hardware, zero copy and kernel by-pass technologies. With those features, RDMA can achieve 100 Gbps throughput and less than 5 \(\mu\)s latency. RDMA has great advantages over TCP and been applied to distributed machine learning system such as tensorflow Jia et al. (2017), Abadi et al. (2016). To further release the potentiality in RDMA, GPUDirect RDMA gpudirect (2018) enables a direct path for data exchange between the GPU and a third-party peer device using standard features of PCIe (e.g. network interfaces), instead of the assistance of CPU, which incurs extra copying and latency. Related work Yi et al. (2017) is trying to introduce GPUDirect RDMA to improve the performance of distributed machine learning system.

Recently, Systems with multiple GPUs and CPUs are becoming common in AI computing. These GPUs and CPUs communicate with each other via PCIe. However, GPUs is gaining more and more computation ability, the traditional PCIe bandwidth is increasingly becoming the bottleneck at the multi-GPU system level, driving the need for a faster and more scalable multiprocessor interconnect. The NVIDIA NVLink nvlink et al. (2018) technology addresses this interconnection issue by providing higher bandwidth, more links, and improved scalability for multi-GPU and multi-GPU/CPU system configurations. A single NVIDIA Tesla V100 GPU supports up to six NVLink connections and a total bandwidth of 300 GB/sec10X the bandwidth of PCIe Gen 3. Servers like the new NVIDIA DGX-1 dgx (2017) take advantage of these technologies to gain greater scalability for ultrafast deep learning training.

4.3.2 Data compression and communication filter

Usually, the parameters are stored and transferred in the key-value format, which may cause redundancy because values can be small (floats or integers) and the same keys are transferred during each interaction between server and worker. To mitigate this, several tricky strategies are adopted:

Transferring the updated portion of parameters Li et al. (2014a), Hsieh et al. (2017) Since parameters in model are represented as structured mathematical objects, such as vectors, matrices, or tensors and typically a part of the object is updated at each iteration, only partial or full matrix is transferred between them, thus greatly reducing the communication cost.

Transferring the values instead of key-value pairs Due to the range-based push and pull, a range of key-value pairs is communicated at each iteration. When the same range is chosen again, it is likely that only values are changed while the keys are unmodified. If both the sender and receiver have cached these keys, only the values with a signature of the keys need to be transferred between them. Therefore, the network bandwidth is effectively doubled Li et al. (2014b).

Compressing the transferred data Since the values transferred is compressible numbers, such as zeros, small integers, and 32-bit floats with an excessive level of precision, communication cost can be reduced by using lossless or lossy data compression algorithms, Li et al. (2014b) compress the sparse matrix by eliminating most zeros values, gRPC Abadi et al. (2016) eliminated the redundancy to decrease the transferred data by novel compression algorithms, Wei et al. (2015) used a 16-bit float to replace 32-bit float value to improve the utilization of bandwidth and Chilimbi et al. (2014), Zhang et al. (2017) decompose the gradient in Full-Connection layer as two vectors to decrease the transferred data.

Besides, Li et al. (2014b), Hsieh et al. (2017) have observed that many updates in one iteration is negligible in changing parameters, to balance the computation efficiency and communication, Li et al. (2014b), Chen et al. (2015) adopted a KKT (Karush Kuhn Tucker) threshold to filter the most insignificant updates and just transfer those updates which can dramatically affect the parameters while keeps the convergence; Gaia Hsieh et al. (2017) has also adopted a dynamic filter threshold to get the significant updates and make it efficient training a model over WAN.

4.3.3 Batch computation

Gradient descent (GD) has been widely used in various kinds of distributed machine learning scenarios, which requires considerable communication between GPUs because each GPU must exchange both gradients and parameter values on every update step. To reduce the communication cost, the batch computation idea is then applied to the optimization of GD and becomes a prevalent method in distributed machine learning. Mini-batch gradient descent (MBGD) Cotter et al. (2011) divides the training data into several parts (each part is called one batch) and uses one batch to update parameters. Although large mini-batches are prone to reduce more communication cost, they may slow down convergence rate in practice Byrd et al. (2012). Thus, the size of mini-batch should not be set too large. With the moderate batch size, parameters can be updated frequently. Meanwhile, compared with stochastic gradient descent (SGD), which uses one sample each time to update its model, MBGD retains good convergence properties Li et al. (2014c) and enjoys a better robustness to noises since the batch data can smooth the biases caused by the noisy points. D-PSGD Lian et al. (2017) is recently proposed to utilize ring-based topology for improving distributed machine learning performance. During each iteration of D-PSGD, each node calculates gradients based on the former values in the last iteration and local training dataset, then it collects the previous parameters from its neighbors (there will be two neighbors in the ring-based topology) and average the three copies of parameters (i.e. the two copies from neighbors as well as the local parameters on itself ) to replace the local parameters. Finally, it uses the calculated gradients to update the fresh parameters D-PSGD is proven to have the same computation complexity as well as smoother bandwidth consumption over the traditional topology such as PS-based topology. Compared to AllReduce, it gains better performance by reducing the number of communication in high latency network. Besides, it follows parallel workflow and overlaps the time for parameters collection and gradients updates, thus gaining better performance.

5 Future direction for network & ML

The advent of ML revolutions brings fresh vitality to computer network research whereas the improvement of network performance also provides better support for ML computations. The combination of computer network and ML technology is a frontier area and many open issues still remain to be explored. Generally speaking, the future research will focus on the two main dimensions.

5.1 Network by ML

Lots of network-related challenges are expected to be solved or mitigated with the integration of ML technologies. As introduced in the paper, QoS optimization and Traffic analysis gain much benefit from the machine learning techniques. “Network by AI” will remain as a hot topic, which aims to adopt ML technologies to solve network problems. Towards this direction, some major points should be concerned.

  1. 1.

    Data Data collection is a key step for most ML techniques. The quality and quantity of data can significantly affect the following modeling and training process. However, network-related data may touch the individual privacy and usually unavailable. For instance, the encryption improves the barrier for data accessing and fails many analytical methods. Further, when the data is accessible, the preprocessing of the network data also requires special consideration. Noisy data and irrelevant features may damage the accuracy of the training models. The filtering and cleaning of network data are expected to involve much effort and skills. The lack of labeled network data is also a big challenge.

  2. 2.

    Modeling Confronted with a variety of models and training algorithms, it can be difficult to make the proper choices that match the scenario. In the prior works, some classic models and methods are employed to solve the network problems, such as basic SVM, linear regression, etc. To gain a better performance, more advanced models are applied to better fit the practical cases. No doubt deep learning and reinforcement learning provide more powerful tools for complex network problems. However, the modeling of a training process should be conducted with a full understanding of the practical problems. The abuse of deep learning and reinforcement learning may not gain much benefit.


5.2 Network for ML

With the popularity of deep learning and reinforcement learning, which both impose an increasing demand of computation capacity. One direction is to improve the performance of computational ability for single machine with advanced processors (i.e. TPU, DGX-1, HPC, etc.). Distributed machine learning can be another competitive solution, where ML can benefit a lot from high-performance network techniques.

Although there are some novel works on improving the efficiency of ML platform with high-performance network, it still remains as an active research area in both academia and industry, which has involved much effort but still leaves many open issues:

  1. 1.

    Network topology There are two main topologies used today: centralized topology (PS-based) and decentralized topology (ring-based), these two topologies still have some drawbacks that hamper the scalability and performance. For instance, bandwidth on the server node will become the bottleneck of the whole machine for centralized topology, on the other hand, ring-based topology lacks fault tolerance, which is infeasible in practice. As the aforementioned drawbacks hamper the performance, an ideal topology with the advantage of centralized and decentralized could benefit the performance of distributed machine learning.

  2. 2.

    Network protocol The reduction of communication cost is also a key concern in ML platform. Recent works (such as MPI, RDMA, GPUDirect RDMA, etc.) greatly mitigate the communication bottleneck. However, some drawbacks like the naive flow control are inefficient in large scale of network and downgrade the throughput in reality. So it can still be optimized with communication pattern to improve the performance of distributed machine learning.

  3. 3.

    Fault tolerance Fault tolerance is a long-term concern for both network infrastructure and it will also play a significant role in ML platform construction. Different to other applications, it is less sensitive in updating parameter in training, so how to design a system with slack fault tolerance to improve the efficiency of ML system is an open issue.


Network has always played a fundamental role in computer engineering. The recent development of ML technology brings lots of novel ideas and methods for network research. It is believed that the combination of network and ML will generate more innovations and create more values in the near future.


  1. 1.

    In this paper, we use node and machine as synonyms.



This work is supported by the National Natural Science Foundation of China under Grants No. 61772305.


  1. AMD.: Accelerators for High Performance Compute. (2017)
  2. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). pp. 265–283. USENIX Association, GA (2016)Google Scholar
  3. Antonakakis, M., Perdisci, R., Nadji, Y., Vasiloglou, N., Abu-Nimeh, S., Lee, W., Dagon, D.: From throw-away traffic to bots: Detecting the rise of dga-based malware. In: Presented as part of the 21st USENIX Security Symposium (USENIX Security 12). pp. 491–506. USENIX, Bellevue, WA (2012)Google Scholar
  4. Archer, C., Blocksome, M.: Remote direct memory access US Patent 8,325,633 (2012)
  5. Ashfaq, A.B., Javed, M., Khayam, S.A., Radha, H.: An information-theoretic combining method for multi-classifier anomaly detection systems. In: 2010 IEEE International Conference on Communications. pp. 1–5 (2010)Google Scholar
  6. Ballani, H., Costa, P., Karagiannis, T., Rowstron, A.: Towards predictable datacenter networks, pp. 242–253. SIGCOMM., ’11, ACM, New York, NY, USA (2011)Google Scholar
  7. Bao, Y., Wu, H., Liu, X.: From prediction to action:a closed-loop approach for data-guided network resource allocation. In: In Proceedings of the SigKDD ’16 Conference. pp. 1425–1434 (2016)Google Scholar
  8. Baralis, E.M., Mellia, M., Grimaudo, L.: Self-learning classifier for internet traffic (2013)Google Scholar
  9. Bartos, K., Sofka, M., Franc, V.: Optimized invariant representation of network traffic for detecting unseen malware variants. In: 25th USENIX Security Symposium (USENIX Security 16). pp. 807–822. USENIX Association, Austin, TX (2016)Google Scholar
  10. Bartos, K., Sofka, M., Franc, V.: Optimized invariant representation of network traffic for detecting unseen malware variants. In: USENIX Security Symposium. pp. 807–822 (2016)Google Scholar
  11. Borgolte, K., Kruegel, C., Vigna, G.: Meerkat: Detecting website defacements through image-based object recognition. In: 24th USENIX Security Symposium (USENIX Security 15). pp. 595–610. USENIX Association, Washington, DC (2015)Google Scholar
  12. Botezatu, M.M., Giurgiu, I., Bogojeska, J., Wiesmann, D.: Predicting disk replacement towards reliable data centers. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48 (2016)Google Scholar
  13. Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Math. Program. 134(1), 127–155 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  14. Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274 (2015)Google Scholar
  15. Chilimbi, T., Suzue, Y., Apacible, J., Kalyanaraman, K.: Project adam: Building an efficient and scalable deep learning training system. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). pp. 571–582. USENIX Association, Broomfield, CO (2014)Google Scholar
  16. Chowdhury, M., Stoica, I.: Efficient coflow scheduling without prior knowledge, pp. 393–406 (2015)Google Scholar
  17. Comar, P.M., Liu, L., Saha, S., Tan, P.N., Nucci, A.: Combining supervised and unsupervised learning for zero-day malware detection. In: 2013 Proceedings IEEE INFOCOM, pp. 2022–2030 (2013)Google Scholar
  18. Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24, pp. 1647–1655. Curran Associates, Inc. (2011)Google Scholar
  19. Das, A.K., Pathak, P.H., Chuah, C.N., Mohapatra, P.: Contextual localization through network traffic analysis. In: INFOCOM, 2014 Proceedings IEEE, pp. 925–933. IEEE (2014)Google Scholar
  20. Dean, J., Corrado, G.S., Monga, R., Chen, K., Devin, M., Le, Q.V., Mao, M.Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., Ng, A.Y.: Large scale distributed deep networks, pp. 1223–1231. Associates Inc., USA, NIPS’12, Curran (2012)Google Scholar
  21. Dong, M., Li, Q., Zarchy, D., Godfrey, P.B., Schapira, M.: Pcc: re-architecting congestion control for consistent high performance. NSDI 1, 2 (2015)Google Scholar
  22. Fontugne, R., Borgnat, P., Abry, P., Fukuda, K.: Mawilab:combining diverse anomaly detectors for automated anomaly labeling and performance benchmarking. In: International Conference of CoNext, pp. 1–12 (2010)Google Scholar
  23. Foundation, T.A.S.: Hadoop project. (2009)
  24. Franc, V., Sofka, M., Bartos, K.: Learning detector of malicious network traffic from weak labels. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 85–99. Springer (2015)Google Scholar
  25. Franc, V., Sofka, M., Bartos, K.: Learning detector of malicious network traffic from weak labels. In: Proceedings, Part III, of the European Conference on Machine Learning and Knowledge Discovery in Databases —Volume 9286. pp. 85–99. ECML PKDD 2015, Springer, New York, Inc., New York, NY, USA (2015)Google Scholar
  26. Furno, A., Fiore, M., Stanica, R.: Joint spatial and temporal classification of mobile traffic demands. In: INFOCOM—36th Annual IEEE International Conference on Computer Communications (2017)Google Scholar
  27. Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., Tian, C., Zhang, Y., Lu, S.: Bcube:a high performance, server-centric network architecture for modular data centers, pp. 63–74 (2009)Google Scholar
  28. Hayes, J., Danezis, G.: k-fingerprinting: a robust scalable website fingerprinting technique. In: 25th USENIX Security Symposium (USENIX Security 16), pp. 1187–1203. USENIX Association, Austin, TX (2016)Google Scholar
  29. He, T., Goeckel, D., Raghavendra, R., Towsley, D.: Endhost-based shortest path routing in dynamic networks: An online learning approach. In: INFOCOM, 2013 Proceedings IEEE, pp. 2202–2210 (2013)Google Scholar
  30. Hsieh, K., Harlap, A., Vijaykumar, N., Konomis, D., Ganger, G.R., Gibbons, P.B., Mutlu, O.: Gaia: Geo-distributed machine learning approaching LAN speeds. In: 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pp. 629–647. USENIX Association, Boston, MA (2017)Google Scholar
  31. Jayaraj, A., Venkatesh, T., Murthy, C.S.R.: Loss classification in optical burst switching networks using machine learning techniques: improving the performance of TCP. IEEE J. Sel. Areas Commun. 26(6), 45–54 (2008)CrossRefGoogle Scholar
  32. Jia, C., Liu, J., Jin, X., Lin, H., An, H., Han, W., Wu, Z., Chi, M.: Improving the performance of distributed tensorflow with RDMA. Int. J. Parallel Program. 3, 1–12 (2017)Google Scholar
  33. Jiang, J., Sekar, V., Milner, H., Shepherd, D., Stoica, I., Zhang, H.: CFA: A practical prediction system for video QoE optimization. In: NSDI, pp. 137–150 (2016)Google Scholar
  34. Li, D., Chen, C., Guan, J., Zhang, Y., Zhu, J., Yu, R.: Dcloud: Deadline-aware resource allocation for cloud computing jobs. IEEE Trans. Parallel Distrib. Syst. 27(8), 2248–2260 (2016)CrossRefGoogle Scholar
  35. Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.Y.: Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, pp. 583–598. OSDI’14, USENIX Association, Berkeley, CA, USA (2014a)Google Scholar
  36. Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: International conference on neural information processing systems, MIT Press, Cambridge, pp. 19–27 (2014b)Google Scholar
  37. Li, M., Zhang, T., Chen, Y., Smola, A.J.: Efficient mini-batch training for stochastic optimization. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 661–670. ACM (2014c)Google Scholar
  38. Li, W., Zhou, F., Meleis, W., Chowdhury, K.: Learning-based and data-driven tcp design for memory-constrained iot. In: Distributed Computing in Sensor Systems, pp. 199–205. IEEE (2016)Google Scholar
  39. Li, X., Bian, F., Crovella, M., Diot, C., Govindan, R., Iannaccone, G., Lakhina, A.: Detection and identification of network anomalies using sketch subspaces. In: ACM SIGCOMM Conference on Internet Measurement, pp. 147–152 (2006)Google Scholar
  40. Lian, X., Zhang, C., Zhang, H., Hsieh, C.J., Zhang, W., Liu, J.: Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent (2017)Google Scholar
  41. Liu, D., Zhao, Y., Sui, K., Zou, L., Pei, D., Tao, Q., Chen, X., Tan, D.: Focus: Shedding light on the high search response time in the wild. In: IEEE INFOCOM 2016—the IEEE International Conference on Computer Communications, pp. 1–9 (2016)Google Scholar
  42. Liu, D., Zhao, Y., Xu, H., Sun, Y., Pei, D., Luo, J., Jing, X., Feng, M.: Opprentice: towards practical and automatic anomaly detection through machine learning, Tokyo, Japan (2015)Google Scholar
  43. Low, Y., Gonzalez, J.E., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Graphlab: a new framework for parallel machine learning. CoRR abs/1408.2041 (2014)Google Scholar
  44. Lutu, A., Bagnulo, M., Cid-Sueiro, J., Maennel, O.: Separating wheat from chaff: Winnowing unintended prefixes using machine learning. In: IEEE INFOCOM 2014—IEEE Conference on Computer Communications, pp. 943–951 (2014)Google Scholar
  45. Ma, S., Jiang, J., Li, B., Li, B.: Maximizing container-based network isolation in parallel computing clusters. In: Edition of the IEEE International Conference on Network Protocols, pp. 1–10 (2016)Google Scholar
  46. Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning, pp. 50–56., HotNets ’16, ACM, New York, NY, USA (2016)Google Scholar
  47. Mao, H., Netravali, R., Alizadeh, M.: Neural adaptive video streaming with pensieve, pp. 197–210. ACM (2017)Google Scholar
  48. Mirza, M., Sommers, J., Barford, P., Zhu, X.: A machine learning approach to tcp throughput prediction. In: ACM SIGMETRICS Performance Evaluation Review, vol. 35, pp. 97–108. ACM (2007)Google Scholar
  49. NVIDIA.: GPU APPLICATIONS: transforming computational research and engineering. (2017)
  50. NVIDIA.: Developing a linux kernel module using gpudirect rdma. (2018)
  51. NVIDIA: Nvidia dgx-1: the fastest deep learning system. (2017)
  52. Nandi, A., Mandal, A., Atreja, S., Dasgupta, G.B., Bhattacharya, S.: Anomaly detection using program control flow graph mining from execution logs, pp. 215–224., KDD ’16, ACM, New York, NY, USA (2016)Google Scholar
  53. Neuvirth, H., Finkelstein, Y., Hilbuch, A., Nahum, S., Alon, D., Yom-Tov, E.: Early detection of fraud storms in the cloud. In: Proceedings, Part III, of the European Conference on Machine Learning and Knowledge Discovery in Databases —volume 9286, pp. 53–67. ECML PKDD 2015, Springer-Verlag New York, Inc., New York, NY, USA (2015)Google Scholar
  54. Nunes, B.A., Veenstra, K., Ballenthin, W., Lukin, S., Obraczka, : K.: A machine learning approach to end-to-end rtt estimation and its application to tcp, pp. 1–6. IEEE (2011)Google Scholar
  55. Research., B.: Bringing HPC techniques to deep learning. (2017)
  56. Santiago del Rio, P.M., Rossi, D., Gringoli, F., Nava, L., Salgarelli, L., Aracil, J.: Wire-speed statistical classification of network traffic on commodity hardware, pp. 65–72. ACM (2012)Google Scholar
  57. Sivaraman, A., Winstein, K., Thaker, P., Balakrishnan, H.: An experimental study of the learnability of congestion control. In: ACM SIGCOMM Computer Communication Review, vol. 44, pp. 479–490. ACM (2014)Google Scholar
  58. Soska, K., Christin, N.: Automatically detecting vulnerable websites before they turn malicious. In: 23rd USENIX Security Symposium (USENIX Security 14), pp. 625–640, USENIX Association, San Diego, CA (2014)Google Scholar
  59. Soule, A., Taft, N.: Combining filtering and statistical methods for anomaly detection. In: Conference on Internet Measurement 2005, Berkeley, California, Usa, pp. 31–31 (2005)Google Scholar
  60. Stringhini, G., Kruegel, C., Vigna, G.: Shady paths: Leveraging surfing crowds to detect malicious web pages, pp. 133–144., CCS ’13, ACM, New York, NY, USA (2013)Google Scholar
  61. Sun, Y., Yin, X., Jiang, J., Sekar, V., Lin, F., Wang, N., Liu, T., Sinopoli, B.: Cs2p: Improving video bitrate selection and adaptation with data-driven throughput prediction. In: Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference, pp. 272–285, ACM (2016)Google Scholar
  62. Tan, H., Han, Z., Li, X., Lau, F.C.M.: Online job dispatching and scheduling in edge-clouds (2017)Google Scholar
  63. Taylor, V.F., Spolaor, R., Conti, M., Martinovic, I.: Appscanner: Automatic fingerprinting of smartphone apps from encrypted network traffic. In: 2016 IEEE European Symposium on Security and Privacy (EuroS P). pp. 439–454 (March 2016)Google Scholar
  64. Wang, G., Wang, T., Zheng, H., Zhao, B.Y.: Man vs. machine: Practical adversarial detection of malicious crowdsourcing workers. In: 23rd USENIX Security Symposium (USENIX Security 14), pp. 239–254. USENIX Association, San Diego, CA (2014)Google Scholar
  65. Wang, W., Zhang, Q.: A stochastic game for privacy preserving context sensing on mobile phone. In: IEEE INFOCOM 2014—IEEE Conference on Computer Communications, pp. 2328–2336 (2014)Google Scholar
  66. Wang, Z.: The applications of deep learning on traffic identification. BlackHat USA (2015)Google Scholar
  67. Wei, J., Dai, W., Qiao, A., Ho, Q., Cui, H., Ganger, G.R., Gibbons, P.B., Gibson, G.A., Xing, E.P.: Managed communication and consistency for fast data-parallel iterative analytics, pp. 381–394. ACM (2015)Google Scholar
  68. Winstein, K., Balakrishnan, H.: Tcp ex machina: computer-generated congestion control. In: ACM SIGCOMM Computer Communication Review, vol. 43, pp. 123–134. ACM (2013)Google Scholar
  69. Xiao, W., Xue, J., Miao, Y., Li, Z., Chen, C., Wu, M., Li, W., Zhou, L.: Tux2: Distributed graph computation for machine learning. In: 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pp. 669–682. USENIX Association, Boston, MA (2017)Google Scholar
  70. Xie, D., Ding, N., Hu, Y.C., Kompella, R.: The only constant is change: incorporating time-varying network reservations in data centers. ACM Sigcomm. Comput. Commun. Rev. 42(4), 199–210 (2012)CrossRefGoogle Scholar
  71. Xing, E.P., Ho, Q., Dai, W., Kim, J.K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., Yu, Y.: Petuum: a new platform for distributed machine learning on big data. IEEE Trans. Big Data 1(2), 49–67 (2015)CrossRefGoogle Scholar
  72. Xu, Q., Liao, Y., Miskovic, S., Mao, Z.M., Baldi, M., Nucci, A., Andrews, T.: Automatic generation of mobile app signatures from traffic observations. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 1481–1489 (2015)Google Scholar
  73. Xu, Y., Yao, J., Jacobsen, H.A., Guan, H.: Cost-efficient negotiation over multiple resources with reinforcement learning. Spain, Barcelona (2016)Google Scholar
  74. Yamada, M., Kimura, A., Naya, F., Sawada, H.: Change-point detection with feature selection in high-dimensional time-series data. J. Catalysis 111(1), 50–58 (2013)Google Scholar
  75. Yi, B., Xia, J., Chen, L., Chen, K.: Towards zero copy dataflows using rdma. In: Proceedings of the SIGCOMM Posters and Demos, vol. 2017. ACM (2017)Google Scholar
  76. Zadrozny, B.: Learning and evaluating classifiers under sample selection bias, pp. 114, ICML ’04, ACM, New York, NY, USA (2004)Google Scholar
  77. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing. pp. 10–10. HotCloud’10, USENIX Association, Berkeley, CA, USA (2010)Google Scholar
  78. Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X., Hu, Z., Wei, J., Xie, P., Xing, E.P.: Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In: 2017 USENIX Annual Technical Conference (USENIX ATC 17), pp. 181–193. USENIX Association, Santa Clara, CA (2017)Google Scholar
  79. Zhang, R., Qi, W., Wang, J.: Cross-vm covert channel risk assessment for cloud computing: An automated capacity profiler. In: 2014 IEEE 22nd International Conference on Network Protocols, pp. 25–36 (2014)Google Scholar
  80. Zhang, X., Wu, C., Li, Z., Lau, F.C.M.: Proactive vnf provisioning with multi-timescale cloud resources: fusing online learning and online optimization. In: IEEE INFOCOM 2017-IEEE Conference on Computer Communications (INFOCOM), pp. 1–9. IEEE (2017)Google Scholar
  81. Zhang, Z., Zhang, Z., Lee, P.P., Liu, Y., Xie, G.: Proword: an unsupervised approach to protocol feature word extraction. In: INFOCOM, 2014 Proceedings IEEE, pp. 1393–1401. IEEE (2014)Google Scholar
  82. Zheng, R., Le, T., Han, Z.: Approximate online learning for passive monitoring of multi-channel wireless networks. Proc. IEEE INFOCOM 12(11), 3111–3119 (2013)Google Scholar
  83. Zheng, N., Bai, K., Huang, H., Wang, H.: You are how you touch: User verification on smartphones via tapping behaviors. In: 2014 IEEE 22nd International Conference on Network Protocols, pp. 221–232 (2014)Google Scholar
  84. Zhou, X., Wang, K., Jia, W., Guo, M.: Reinforcement learning-based adaptive resource management of differentiated services in geo-distributed data centers. Spain, Barcelona (2016)Google Scholar
  85. Zhu, J., Li, D., Wu, J., Liu, H., Zhang, Y., Zhang, J.: Towards bandwidth guarantee in multi-tenancy cloud computing networks. In: IEEE International Conference on Network Protocols, pp. 1–10 (2012)Google Scholar

Copyright information

© China Computer Federation (CCF) 2018

Authors and Affiliations

  • Yang Cheng
    • 1
  • Jinkun Geng
    • 1
  • Yanshu Wang
    • 1
  • Junfeng Li
    • 1
  • Dan Li
    • 1
    Email author
  • Jianping Wu
    • 1
  1. 1.Department of Computer Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations