Research on GRU Neural Network Satellite Traffic Prediction Based on Transfer Learning

In this paper, we propose a Gated Recurrent Unit(GRU) neural network traffic prediction algorithm based on transfer learning. By introducing two gate structures, such as reset gate and update gate, the GRU neural network avoids the problems of gradient disappearance and gradient explosion. It can effectively represent the characteristics of long correlation traffic, and can realize the expression of nonlinear, self-similar, long correlation and other characteristics of satellite network traffic. The paper combines the transfer learning method to solve the problem of insufficient online traffic data and uses the particle filter online training algorithm to reduce the training time complexity and achieve accurate prediction of satellite network traffic. The simulation results show that the average relative error of the proposed traffic prediction algorithm is 35.80% and 8.13% lower than FARIMA and SVR, and the particle filter algorithm is 40% faster than the gradient descent algorithm.


Introduction
The satellite network traffic is affected by the periodic changes of the satellite network topology, the frequent switching of the satellite inter-satellite links, and the dynamic change of the inter-satellite link on-off relationship with time. The load of the satellite network traffic is adjacent to the geographical location of the satellite. Satellite network traffic has more complex and nonlinear characteristics [1]. To prevent network congestion and improve the utilization of network resources, reasonable network traffic management is especially important. The prediction of network traffic can grasp the changing characteristics and trends of network traffic in advance, to specify a reasonable and effective traffic management strategy to meet the requirements of users for quality of service (QoS) [2]. Therefore, it is of great practical significance to establish a high-precision traffic prediction model for satellite network.

GRU Neural Network
The GRU neural network retains the ability to remember long-term states by using update gates and reset gate structures, and greatly reduces computational complexity [11]. The GRU neural network diagram is below (Fig. 1).
Use the formula to express: (1) The square brackets indicate that two vectors are connected, and ⨂ is matrix elements multiplication. is a sigmoid function whose output is between 0 and 1, indicating the update and forgetting degree of information. r t indicates the update gate, which is used to determine how much information is saved to the next moment at the previous moment. The larger the value of the update gate, the more information from the previous moment is retained. z t indicates the reset gate, which is used to determine the status information of the previous moment. Among them, the parameters we need to learn training are W r , W z , W̃h and W o , the input of the output layer is y i t = W o h , and the output is y o t = y i t .

Traffic Prediction Framework of GRU Neural Network Based on Transfer Learning
To solve the problem of satellite network traffic prediction, this paper proposes a GRU neural prediction framework of network traffic based on transfer learning. The framework is mainly composed of three parts: data processing module, model training module, and model transfer module. The data processing module is mainly responsible for pre-processing the data, converting the continuous flow data into discrete flow data to meet the input requirements of the model. The model building module is the core of the traffic prediction framework. This paper proposes a model tuning method such as batch normalization and dropout. A low complexity training method of particle filter model is proposed. The model transfer module is another important model. It transfers a training model with large number of offline traffic data into online model of satellite to avoid the problem of insufficient online traffic data. Finally, the GRU neural network traffic prediction is constructed.

Data Processing Module
The data processing module samples the flow data at a fixed time interval t to obtain input discrete flow data. Time window is used to convert discrete flow data into a supervised model input data format In the manner of port sliding, discrete data is divided according to a fixed time window size m sliding window, and finally the traffic data is obtained as Taking the data x m of the last m time as the predicted target output Y of the model, that is, the label with the supervised data. After that, the supervised data sequence is divided into training set test sets according to a certain proportion, and finally the data set for model training test is obtained.

Model Building Module
As the core of the traffic prediction framework, the model building module considers the timeliness of satellite network traffic data and limited satellite computing resources and designs a single-layer GRU network structure. This can not only ensure the prediction effect of the model, but also reduce the time for the model to optimize parameters. The overall model structure is a three-layer network model, the first layer is the input layer, and the number of neurons in the input layer is equal to the input traffic data dimension. The second layer is a hidden layer, and the number of neurons in the hidden layer is determined according to the experimental results. The third layer is the output layer, because the model finally predicts that the output is a single flow value, and the number of neurons in the output layer is set to 1.
Model training module: Model training refers to the optimization of the square loss function. The model training reduces the loss function value by constantly adjusting the weight matrix of the network. Usually, the gradient weight reduction method is used to optimize the model weight matrix. However, the gradient descent optimization process may suffer from over-fitting or falling into a local optimal solution. Section 4 details how the ion filter algorithm solves this problem.
Model tuning module: network structure tuning and network parameter tuning. Network structure tuning increases the model's generalization ability, reduces the training time of the model, reduces the possibility of model overfitting, and adds a Dropout layer before the hidden layer [12]. In order to solve the problem of inconsistent data distribution of each batch, batch normalization processing is performed before the activation function [13].
The Dropout layer is an indirect discard, and the output of each neuron is still calculated, and then selecting some neurons with a random probability and their outputs are set to zero. This random discarding method is simple in design, but still needs to calculate discarding neurons, which increases the computational cost of some satellites. This paper designs a pre-drop mode to set the output of neurons, which need to be set to zero. Although the problem of inconsistent data distribution in each batch is solved in the literature [13], some characteristics of the original data itself are lost. This paper introduces the learning parameters and to overcome this problem.
Finally, the overall process of model training and tuning is described as follows: The training of the GRU neural network model can be described as the optimization of the network parameters Θ, so that the difference between the predicted value and the true value of the model is reduced as much as possible: are training data sets, and is the weight parameter in the GRU neural network. The loss function of the model is the mean square error, where Ŷ i is the predicted output of the model. Add a Dropout layer before the hidden layer: where p l j is the Bernoulli probability, designed according to the characteristics of each batch of satellite flow data, x l is randomly discarded based on input x l with probability p l j , and the output of discarded neurons is set zero. Batch normalization means that we normalize a batch of data for a sample: Where:x = x 1 , x 2 , … , x d is a batch of data, is the expectation of the input flow data x, and is the standard deviation of the input flow data x . This batch standardization process can reduce the problem of data inconsistency, but directly inputting the standardized processing x i into the network ignores the feature distribution of the data itself. Therefore, this paper adds two learning parameters i and i to maintain the feature distribution of the original data. After batch normalization, the data input into the activation function is: Where i and i are parameters learned for a batch data model, i and i parameters can retain part of the data features lost due to the normalization operation. Finally, the data distribution input to the activation function is more consistent and has the original data characteristics, and the convergence speed of the model can be improved.

Model Transfer Module
The model transfer module is to realize the migration of the source data model to the destination data model and train the network to learn the neural network feature representation based on the historical large amount of traffic data, and then migrate the model to the online traffic data for training model.
Firstly, the offline flow data is transformed with the data to obtain the input data format of the model, and then the model building module is used to obtain the offline traffic prediction model. Based on the same processing and online traffic prediction model, online traffic data is added, and the model building module is used to retrain and obtain the online traffic prediction model.

Efficient Online Training Method Based on Particle Filter
The key of particle filter algorithm is to determine the state transition equation and observation equation of the system [14]. For the traffic prediction model of GRU neural network, the discrete time value is the number of iterations of the model, and the state of each system is the optimal solution of the model. Equations (1)-(5) as the state transition equation of the system, the mean square error loss function (7) as the system Observe the equation. The training process of GRU neural network model based on particle filter algorithm is as follows: First, a discrete system dynamic model is established. The mathematical model is expressed as follows: Where X t is the system state variable, Z t is the true observation of the system, v t is the system noise, and e t is the measurement noise of the system. Particle initialization: Each particle is considered that having equal weight if the system state is unknown. The initial particle set is generated by sampling with probability density p x 0 : x i 0 , 1 N ;i = 1, 2, … , N . Initialize system state: Calculate the network output value y according to the parameters of the GRU neural network and Eqs. (1)- (5). Set the minimum threshold of the system, N thr . Let the total number of particles in the particle filter algorithm be N , the total number of iterations is tf , set the end loss value l , and randomly generate N particles according to the prior probability density p x 0 . Importance sampling: When k = 1, 2, … , N , to avoid particle degradation, it is necessary to copy some of the particles with higher weight and remove the particles with lower weight.
(1) First, randomly extract N particles from a probability distribution function: (2) Update the weight of the particles and normalize the particle weights: According to the state transition equation p x i k |x i k−1 , N particles are extracted from the initialized particle group. According to the observation Eq. (7) of the system, the matching value of all particles x i k is calculated, and the optimal particle and its corresponding optimal target y value are selected. The weight of the particle that does not satisfy the constraint is reset to zero. When the constraint is satisfied, according to the current observation y i k and the Eqs. (15) and (16) and normalized, updating the weight of the particle.
Resampling: Calculate the number of valid particles: We select the particles with larger weights for copying, and delete the smaller ones. A new set of generated particles is formed: x j k , 1 N ;j = 1, 2, … , N . State Estimation: Estimating System Status and Variance Let k = k+1, continue the calculation, and judge whether the set loss value termination condition is satisfied.
The setting of N eff in the particle filter will directly determine the prediction accuracy of the model. According to the above steps, iterative iteration, so that the optimal state transition equation estimation can be obtained, and the final flow prediction value can be obtained.

Evaluation Indicators
In order to measure the prediction results of the model, three error analysis methods are used to verify the prediction results, namely mean absolute error (MAE), root mean square error (RMSE) and mean relative error(MRE), the formula is as follows [15]: Where Ŷ (i) is the true value, Ŷ (i) is the predicted value, and N is the total number of samples.

Experimental Environment
The GRU neural traffic prediction model of network proposed in this paper is based on the Python2.7 programming language and the tensorflow 1.3 deep learning framework in the Ubuntu 16.04 operating system. The data source used in the experiments in this chapter is "BC-pOct89", which extracts 400,000 data volumes. First, 400,000 data is divided into two parts: a large amount of data sets of 380,000 and 20,000 online real-time data sets. The 380,000 data was input into the network for training, and the offline pre-trained network model was obtained. The 20,000 data sets on the line are divided into training sets and test sets, of which the training set accounts for 4/5 and the test set accounts for 1/5. The training set is input into the pre-trained GRU neural network model and the wavelet filtering method is used for fast training, and the network model parameters are adjusted to obtain the optimal network parameters. During the experiment, the algorithm was verified by the leave-one method, and the test results of the model were obtained. This chapter experiments and compares with the traditional FARIMA, SVR traffic prediction algorithms.

Analysis of Experimental Results
In order to reflect the superiority of the migration learning GRU neural network traffic prediction algorithm proposed in this paper, two comparative experiments are set up in this paper. Compared with the FARIMA-based traffic prediction algorithm, FARIMA can only process short-term time series, only considering the sequence. Statistical continuity before and after, and FARIMA does not have nonlinear fitting ability. Compared with the second experiment based on the SVM algorithm, SVM performs well in the classification and prediction of traditional data, but it does not apply to time series data, and cannot handle data of satellite network traffic well. Both the FARIMA algorithm and the SVM algorithm can only fit short-term traffic characteristics and cannot reflect the long-term and complex nonlinear characteristics of satellite traffic. The specific experimental results are shown in Fig. 2.   Fig. 2 Network traffic forecast results graph To further compare the prediction effects of the GRU model with the SVR and FARIMA models, Table 1 calculates the MAE, MRE and RMSE for the three models. It shows that the table that the prediction results of the GRU model have better MAE, RMSE, and MAE values than the other two models, reflecting the superiority of the GRU model for predicting satellite network traffic.

Online Training Complexity Analysis
This section analyzes the complexity analysis of the particle-based filtering online training method and the traditional training method. The traditional training methods of comparison include stochastic gradient descent (SGD). Using Eq. (7) as the optimization function, the recursive formula for solving the weight of the SGD algorithm is: Where u t represents the learning rate and is a value greater than 0 and less than 1. 0 t is the diagonal matrix of the output. The algorithm complexity of SGD is O m 4 + m 2 p 2 , where p is the input space dimension and m is the output space dimension. The complexity of the SGD algorithm is related to the input space dimension and the output space dimension.
According to the previous section, the complexity of the PF algorithm is O N m 2 + mp , where N is the number of particles.
The particle training online training algorithm has the lowest complexity, although it is related to the number of particles N, but it is usually much smaller than the input space dimension p and the output space dimension m, and its algorithm complexity is lower than that of the random gradient.
In order to verify the low complexity and convergence efficiency of the particle filter algorithm compared with the random gradient descent algorithm, the number of iterations is set to be the same, and it is necessary to observe how much data RMSE needs to be stable when training on the same data set. The delay result is shown in Fig. 3.
We can see from the experimental results that the initial relative error of PF-GRU is lower than that of SGD-GRU on the average relative error MRE index, and its convergence speed is fast. After 450 sets of training data, it can converge and optimize. The error value and the relative error of the SGD-GRU after 750 sets of data is required to stabilize. The particle filter algorithm has a faster convergence rate than the random gradient descent algorithm, and the training required the amount of data is less, and the particle filtering algorithm combined with the previous analysis has lower complexity. Therefore, the particle filter algorithm can effectively reduce the computing and storage resources of the satellite.

Conclusion
This paper analyzes the characteristics of data of satellite network traffic. We proposes a prediction algorithm for GRU neural network traffic based on migration learning. In this paper, the construction process of the GRU neural network model and the model setting method are described in detail. The algorithm flow of the online training update method based on particle filter is given. What's more, we adopt the transfer learning method to avoid the problem of insufficient online traffic data and reduce the consumption of satellite computing resources. The simulation results show that compared with FARIMA algorithm and SVM algorithm, the proposed algorithm has superior prediction accuracy. We verify that the particle update based online update method has low complexity and fast convergence speed. In short, the proposed traffic prediction algorithm has higher traffic prediction accuracy, lower computational complexity, faster convergence speed, and can effectively reduce satellite computing storage resources. It is a superior prediction algorithm for predicting satellite traffic.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.