Introduction

In the next 5–10 years, a new generation of information technology, such as the Internet of Things, cloud computing, big data, artificial intelligence, etc., these will widely penetrate into all fields of economy and society, and the prosperity of the information economy will also become an important indicator of national strength [1,2,3]. To cope with the massive terminal access and data traffic demand brought by the Internet of Vehicles and the Internet of Things in the future, in 2012, ITU (International Telecommunication Union) initiated the preliminary research and development of 5G standardization. In 2013, the European Union launched the “Mobile and Wireless Communications Enablers for Twenty-Twenty Information Society (METIS)” scientific research project. In the same year, the Ministry of Industry and Information Technology of China, the National Development and Reform Commission, and the Ministry of Science and Technology jointly established the IMT-2020 (5G) Promotion Group. 5G technology has been become a hot spot research. However, with the gradual commercialization of 5G communications, some of the key technologies still need to be further solved and improved. For example, confronting complex IoT architecture [4, 5], how to increase system capacity, reduce delay, congestion, energy consumption, and so on. Therefore, network optimization [6,7,8] is one of the key technologies that urgently need to be resolved. To meet the IoT communication performance requirements, researchers have conducted extensive and in-depth research on network optimization technologies from different perspectives. For example, literature [9,10,11] adopts reinforcement learning and neural network methods to effectively optimize the network capacity and coverage issues. To improve the quality of service for edge users, low-power nodes are introduced on the basis of macro base stations, such as femto base stations, pico base stations and relays. The author proposes a heterogeneous network deployment scheme of the Poisson point process [12,13,14], which minimizes system power consumption through joint optimization techniques. As for resource allocation, the author [15, 16] introduced a predictive resource optimization algorithm to change the problem of excessive overhead in cooperative games. Similarly, the multi-objective optimization algorithm based on artificial bee colony [17, 18] provides a good solution to a series of problems. Although the above studies have played a certain role to improve the performance of a certain aspect, these schemes and algorithms have certain limitations. In view of the complexity of IoT structure, the diversity of applications and requirements, and the massive amount of data generated at all times, it is necessary to explore a new self-organization optimization strategy that adapts to 5G IoT environment [19, 20]. Therefore, the literature [21] discussed and put forward a big data network planning scheme, but how to implement it is not explained. In the literature [22, 23], the author discussed the information security issues in the cloud computing architecture. Obviously, the research on Network of Thing architecture based on big data analysis, the literature is obviously insufficient [24]. Therefore, this paper will focus on how to extract valuable information from this massive amount of data, analyzes the massive amount of network data, and explores the relationship between the variables of the data, so as to make predictions and estimates for the networking mode. We believe this will be a very interesting and meaningful exploratory study.

In this paper, the authors mainly do the following:

  1. 1.

    to analyze the relationship of the variables in the data;

  2. 2.

    to establish a mathematical model among the variables;

  3. 3.

    to simulate and analyze the data to verify the correctness of the data model.

This article mainly includes the following content: the second part discusses the related knowledge of big data mining analysis and networking technology; the third part mainly elaborates the mathematical model of big data analysis. The fourth part is the simulation analysis of the research; finally, it summarizes our research.

Big data mining and predictive analysis

Faced with massive information data, the main purpose of big data mining analysis is to study the mechanism of network formation by analyzing the correlation between these data variables, to build an optimization algorithm and to improve ultimately the efficiency and the overall performance [25]. The method based on big data analysis and decision-making is shown in Fig. 1. In Fig. 1, the data comes from three parts: the network environment, the wireless environment, and the user environment. Big data analysis is mainly to explore the logical connection between data variables through analysis algorithms. Network optimization is mainly to plan the network and to optimize the network operating parameters.

Fig. 1
figure 1

Network optimization principle of big data analysis and decision-making

The process of big data mining

Big data mining mainly includes the following stages. First, the research and understanding stage. The main task at this stage is to determine the research object and clarify the research goals and needs. At the same time, the research goals and requirements are transformed into definitions and formulas of data mining problems, and strategies to achieve these goals are determined. Second, the data understanding stage. The task of this stage is to conduct exploratory data analysis on the collected data to familiarize and evaluate the data, and select a subset of data that contains executable patterns. Third, the data preparation stage. The main task at this stage is to select the data variables to be analyzed according to the demand goals. Next, transform the determined variables and clean up the original data. Fourth, the modeling stage. The task at this stage is to use appropriate technology to model and optimize the model. Fifth, the evaluation stage. The task of this stage is to evaluate the effect of the established data model. The main purpose is to confirm whether the established data model meets or achieves the first stage objectives and requirements, and to confirm whether the research problems or important components are clearly explained. Sixth, the deployment phase. The task at this stage is to carry out simple application deployment and establish corresponding data reports for the evaluated model. In the process of data mining technology networking, the six stages are not completely separated, but are interrelated and interdependent.

Networking technology for big data mining and analysis

Network description

The so-called network description refers to the method of discovering networking patterns and trends hidden in the data. For example, after data analysis, it is found that a networking method is more conducive to the smallest energy consumption of the network system, or the largest throughput, or the best route, or the largest cellular coverage area and so on. The data mining model is transparent [26, 27]. In other words, network descriptions should obey intuitive explanations or have clear patterns, so as to provide people with an intuitive and easy-to-understand form. Such as decision trees or graphical methods.

Classification of network variables

Classification is a commonly used technical method. The target variable for classification is category and not number. Variable classification is to classify the load of a new individual (target) [3]. If the new individual is not in the above data set, then the task of data mining and analysis must be based on other characteristics of the individual (such as energy consumption, bandwidth) to carry out the classification work. For example, network load, this variable can be divided into 3 categories: high load, low load and normal load. The data mining model examines a large number of network data records, and each record contains information about the target variable and a set of input variables or predictors. For example, in a network environment consisting of 20 base stations and 500 users in this article, the data set is as shown in Table 1. Suppose that based on this data set, classify the load of a network to provide a basis for the next networking.

Table 1 Network load classification

The algorithm is as follows: first, the predictor variable and the target variable (load) contained in the data set are learned to know which category of load the combination of different variables is related to. Next, the algorithm queries the new record set and assigns the category to the new record. For example, pico base stations may be classified into a category with a relatively high load.

The classification tasks are as follows:

  1. 1.

    According to specific needs, put the new target network into the data queue;

  2. 2.

    Evaluate the impact of the newly added target network on system performance;

  3. 3.

    Evaluate the change in system performance (benefit or risk) caused by changing a target variable.

Network assessment

Network evaluation is to approximate the value of the target variable using a set of numbers or variable values predicted by classification. For example, we are only interested in system energy consumption and bandwidth. Evaluation is based on obtaining complete data records. Of course, these records contain the value of the target variable and the predicted value. Using new observation records, we use the set data model to estimate the difference between the target variable and the predictor variable. Then, we properly evaluate the relationship between system energy consumption and bandwidth and predictive variables. The main contents of the network assessment include the following aspects:

  1. 1.

    Evaluate the system energy consumption and bandwidth utilization after networking;

  2. 2.

    Evaluate the impact of user demand changes on system energy consumption and bandwidth utilization;

  3. 3.

    Assess the proportion of changes in the system energy consumption and utilization brought by the changes of each target variable;

  4. 4.

    Based on the energy consumption of the system, evaluate the system throughput.

Networking model of data mining analysis

The 5G IoT is complex and the environment is harsh. Thus network optimization will be a very important research field [5, 6, 9, 10, 28]. It will play a key role in network energy consumption, system capacity, delay and so on. Next, according to the above technical requirements, we establish a network model based on data mining analysis. The network model is shown in Fig. 2.

Fig. 2
figure 2

Data mining analysis network model

In Fig. 2, there is a macro base station which is called 5G base station. At the same time, several micro base stations and access points are randomly deployed within the constrained coverage. The macro base station, micro base station and accessing points provide voice, media, text and video services for users. In this service area, the data center unit is to obtain data of various base stations, such as bandwidth, throughput, delay, power consumption and other information and so on. Optimization mainly includes a series of intelligent algorithms. The main function uses data mining algorithms to understand the logical relationship between variables and the network optimize algorithms to enhance system bandwidth utilization and throughput, and to reduce delay and power consumption. The main function of the resource management units allocates resources of various base stations. Radio accessing unit is mainly the transceiver equipment and baseband processing equipment, and the control unit is mainly responsible for and coordinating various control commands.

Data analysis

A data set usually contains multiple variables, and these variables are not necessarily independent of each other and unrelated. The data mining analysis is to predict the correlation between these variables. In this article, principal components analysis (PCA) [26] is used, which is a statistical analysis method that finds data associations in a large number of variables. It refers to the linear combination of multiple variables to select a small set of important variables to describe the related structure. These linear combinations are called components. Assuming that there is a data set containing m variables in the network, it can be represented by a subset of k (usually k < m) linear combinations of variables. This means that k variables and m original variables reflect the same amount of data information.

Data processing

The initial variable \(X_{1} ,X_{2} , \ldots ,X_{m}\) represents an m-dimensional space, the principal component represents a new coordinate system. When performing principal component analysis, the original data is normalized firstly. The normalized variable is represented by Zi (n × 1 dimension), the mean is 0 and the standard deviation is 1, then \(Z = (V^{1/2} )^{ - 1} (X - \mu )\). Where \(\mu\) denotes the mean value of X. \(V^{1/2} = \left[ {\begin{array}{*{20}c} {\sigma_{11} } & 0 & \cdots & 0 \\ 0 & {\sigma_{22} } & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & {\sigma_{mm} } \\ \end{array} } \right]\) is an m × m dimension standard deviation matrix. The covariance of variables \(X_{i}\) and \(X_{j}\) can be expressed as:

$$ \sigma_{ij}^{2} = \frac{{\sum\nolimits_{k = 1}^{n} {(x_{ki} - \mu_{i} )(x_{kj} - \mu_{j} )} }}{n} $$
(1)

Principal components

Standard matrix for normalized variable Z is \(Z = \left( {v^{1/2} } \right)^{ - 1} \left( {X - \mu } \right)\), Covariance matrix is \({\text{Cov}}(Z) = (V^{1/2} )^{ - 1} \sum {(V^{1/2} )}^{ - 1}\). Where, the symbol Σ is a symmetric covariance matrix. Therefore, the ith principal component \({\text{Z}} = [Z_{1} ,Z_{2} , \ldots ,Z_{m} ]\) of the standard matrix is given by \(Y_{i} = e_{i}^{^{\prime}} Z\), where ei is the ith eigenvector. The principal component \(Y_{1} ,Y_{2} , \ldots Y_{k}\) is the linear combination of standardized variables in Z.

Multiple regression based on principal components

Because the data contains many variables, to explore the relationship between the target variable and the predictor variable, the multiple regression model can provide a good method for estimation and prediction. The multiple regression equation containing m predictors is expressed as follows:

$$ y = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \cdots + \beta_{m} x_{m} + \varepsilon , $$
(2)

where \(\beta_{0} ,\beta_{1} ,\beta_{2} , \ldots \beta_{m}\) represents the regression coefficient of the parameter model, which is the unknown term, and ℇ is the error term. The least square method can effectively estimate the unknown items of the parameter model. The method is to minimize the sum of squares of the overall error, that denotes:

$$ {\text{SSE}} = \sum\limits_{i = 1}^{n} {\varepsilon_{i}^{2} = } \sum\limits_{i = 1}^{n} {\left( {y_{i} - \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \cdots + \beta_{m} x_{m} } \right)^{2} } $$
(3)

In formula (3), the partial derivatives are:

$$ \begin{gathered} \frac{{\partial {\text{SSE}}}}{{\partial \beta_{0} }} = - 2\sum\limits_{i = 1}^{n} {(y_{i} - \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \cdots + \beta_{m} x_{m} } ) \hfill \\ \frac{{\partial {\text{SSE}}}}{{\partial \beta_{1} }} = - 2\sum\limits_{i = 1}^{n} {x_{1i} (y_{i} - \beta_{0} + \beta_{1} x_{1i} + \beta_{2} x_{2i} + \cdots + \beta_{m} x_{mi} } ) \hfill \\ \;\, \vdots \quad \quad \; \vdots \hfill \\ \frac{{\partial {\text{SSE}}}}{{\partial \beta_{m} }} = - 2\sum\limits_{i = 1}^{n} {x_{mi} (y_{i} - \beta_{0} + \beta_{1} x_{1i} + \beta_{2} x_{2i} + \cdots + \beta_{m} x_{mi} } ) \hfill \\ \end{gathered} $$
(4)

Let formula (4) be equal to 0,

$$ \begin{gathered} n\beta_{0} + \beta_{1} \sum\limits_{i = 1}^{n} {x_{1i} } + \cdots + \beta_{m} \sum\limits_{i = 1}^{n} {x_{mi} = \sum\limits_{i = 1}^{n} {y_{i} } } \hfill \\ \beta_{0} \sum\limits_{i = 1}^{n} {x_{1i} } + \beta_{1} \sum\limits_{i = 1}^{n} {\left( {x_{1i} } \right)^{2} } + \cdots + \beta_{m} \sum\limits_{i = 1}^{n} {x_{mi} x_{1i} = \sum\limits_{i = 1}^{n} {y_{i} x_{1i} } } \hfill \\ \quad \quad \vdots \quad \; \vdots \quad \quad \; \vdots \quad \quad \quad \vdots \,\; \cdots \;\; \vdots \quad \quad \; \vdots \quad \quad \;\;\; \vdots \quad \vdots \hfill \\ \beta_{0} \sum\limits_{i = 1}^{n} {x_{mi} } + \beta_{1} \sum\limits_{i = 1}^{n} {x_{1i} } + \cdots + \beta_{m} \sum\limits_{i = 1}^{n} {\left( {x_{mi} } \right)^{2} = \sum\limits_{i = 1}^{n} {y_{i} x_{mi} } } \hfill \\ \end{gathered} $$
(5)

The matrix of formula (5) is

$$ {\text{X}}B = Y $$
(6)

where \(X = \left[ {\begin{array}{*{20}c} n & {\sum\limits_{i = 1}^{n} {x_{1i} } } & \cdots & {\sum\limits_{i = 1}^{n} {x_{mi} } } \\ {\sum\limits_{i = 1}^{n} {x_{1i} } } & {\sum\limits_{i = 1}^{n} {\left( {x_{1i} } \right)^{2} } } & \cdots & {\sum\limits_{i = 1}^{n} {x_{mi} x_{1i} } } \\ \vdots & \vdots & \vdots & \vdots \\ {\sum\limits_{i = 1}^{n} {x_{mi} } } & {\sum\limits_{i = 1}^{n} {x_{mi} x_{1i} } } & \cdots & {\sum\limits_{i = 1}^{n} {\left( {x_{mi} } \right)^{2} } } \\ \end{array} } \right]\), \(B = \left[ \begin{gathered} \beta_{0} \hfill \\ \beta_{1} \hfill \\ \vdots \hfill \\ \beta_{m} \hfill \\ \end{gathered} \right]\), \(Y = \left[ \begin{gathered} \sum\limits_{i = 1}^{n} {y_{i} } \hfill \\ \sum\limits_{i = 1}^{n} {y_{i} x_{1i} } \hfill \\ \;\, \vdots \hfill \\ \sum\limits_{i = 1}^{n} {y_{i} x_{mi} } \hfill \\ \end{gathered} \right],\)

According to the nature of the determinant equation [29], the solution of the Eq. (6) is:

$$ \beta_{{0}} = \frac{{X_{1}^{^{\prime}} }}{\left\| X \right\|},\;\beta_{2} = \frac{{X_{2}^{^{\prime}} }}{\left\| X \right\|},\; \ldots ,\;\beta_{m} = \frac{{X_{m}^{^{\prime}} }}{\left\| X \right\|}, $$

where the symbol \(\left\| \bullet \right\|\) is expressed as the value of the X matrix determinant.

We make a simple constraint, that is, the value of \(\left\| \bullet \right\|\) not zero, and the determinant is full rank. If X is full rank, then X is an invertible matrix, then the value of the parameters of the multivariate regression equation can be quickly calculated B = X−1*Y. Where, X−1 is the inverse matrix of X.

Simulation analysis

In the article, we set a data simulation analysis on the set network environment. Network parameter settings are shown in Table 2:

Table 2 Network parameter settings

Next we deploy base stations as shown in Fig. 4. In Fig. 3, the symbol "O" represents the coverage of each of base station, where the number represents the label of the base station; and the symbol "•" represents randomly distributed users.

Fig. 3
figure 3

Distribution of base stations and users

This paper conducts a systematic simulation to obtain the data generated during the operation. We intercept the data at a certain moment as a data set. The data set includes power consumption, throughput, coverage area, number of served users, user density, and percentage of unserved users. The data is shown in Table 3:

Table 3 Data of each base station at any time

In Table 3, the rows represent base stations, and the columns represent variables such as power consumption, throughput, coverage area, number of users served, user density, and percentage of unserved users. Next, normalize the data in Table 3. In data processing, the correlation matrix is mainly used to observe the correlation structure between predictor variables. The correlation matrix of the calculated variables is shown in Table 4.

Table 4 Variable correlation matrix

Next, we use principal component analysis to obtain the principal component matrix in the data set. The Table 5 shows the component matrix. The columns in the table represent one of the components, and the elements in the columns represent the weight of the component, that is, the correlation coefficient, which is between − 1 and 1.

Table 5 Principal component matrix

It can be seen from Table 5 that the first principal component can be used as a best summary of the correlation of the predicted value. Because the linear combination of the first principal component \(Y_{1} = e_{11} Z_{1} + e_{21} Z_{2} + \cdots + e_{m1} Z_{m}\), which has greater variability than any other combination in Z, namely \({\text{Var}}(Y_{1} ) = e_{1}^{^{\prime}} \rho e_{1}\) is the maximum. ρ is the correlation coefficient. It can be seen from the correlation matrix that the base station throughput, service users, user density and the first principal component have a high degree of correlation.

The proportion of the first principal component of the variable Z in the overall variation can be expressed by the ratio of its characteristic value to the number of predictors. For example, the first feature is 4.3624, and there are 6 predictors in the component matrix, so the first principal component explains 4.3624/6 = 72.71% of the variation. This shows that this component explains 72% of the change in the six predictor variables in the data set. At the same time, it also shows that this component contains 72% of the predictor variables in the data set. The table shows each component and the proportion of variation caused by that component.

The second principal component Y2 is the second best linear combination of variables and is orthogonal to the first principal component. The remaining principal components are defined in sequence.

It can be seen from Table 6 that among the 6 principal components, 1 to 4 principal components can explain 100% of the variation. And some principal component eigenvalues are extremely low, almost not occupying the variability in the explanatory Z variable. Obviously, it is not necessary to retain all the principal components. Therefore, how many principal components are retained or extracted is also the key to data analysis. In this paper, the principal components are selected by means of eigenvalue slope diagram [21, 26, 30]. The curve of the number of eigenvalue components of the slope graph can be used to determine how many principal components should be retained. Figure 4 is the curve of the eigenvalue slope graph. From the slope map, at most 3 principal components can be extracted, because the 3 principal components can already explain 99% of the variance information.

Table 6 Principal component eigenvalues and variation ratio
Fig. 4
figure 4

Slope diagram

To predict the variables of interesting, such as system energy consumption. We used a multiple regression model based on principal components [26] to predict. When we perform multiple regression prediction on the variables of interesting, due to the multi-collinearity problem in the data set [11], this will cause the regression results to become unstable. This is also the reason why the principal component analysis method is used to analyze the predictor variables. According to the above analysis, the obtained multiple regression model:

$$ sY = b_{0} + b_{1} X_{1} + b_{2} X_{2} + b_{3} X_{3} $$
(7)

formula (7) \(b_{0} ,b_{1} ,b_{2} ,b_{3}\) is the undetermined coefficient of the regression equation and \(X_{1} ,X_{2} ,X_{3}\) is the number of principal components. According to the above analysis, we take three principal components. Using Eq. (56), the final regression equation is:

$$ sY = - 0.0010 + 0.0174X_{1} + 0.0043X_{2} - 0.0000X_{3} $$
(8)

Next, according to the regression Eq. (8), we test the energy consumption of each base station in the data set, and the prediction results and residuals are shown in Figs. 5 and 6. It can be seen from Fig. 5 that the using of the principal component multiple regression model can predict the variables we care about well, and the residuals are small.

Fig. 5
figure 5

Regression and comparison

Fig. 6
figure 6

Source data and regression value residuals

Conclusion

The 5G IoT architecture is more complex. To improve the network performance, we adopt a network architecture based on big data analysis. In this paper, we utilize centers data to obtain network operation data, and use data mining analysis to obtain the inherent connections between variables in the massive data as a basis for optimizing the network architecture. We use the principal component analysis method to obtain the principal component matrix of the variables of the massive data, and establish a multiple regression mathematical model based on principal components. The simulation results show that this method can effectively predict the data and there is a small residual error. In our simulation algorithm, it has the features of simple algorithm and fast operation speed. Therefore, it will play an important role in the optimization of the 5G network architecture and provide an important basis for further research.