Introduction

According to the report of the 12th annual Cisco Visual Networking Index Complete Forecast,Footnote 1 the number of internet users will increase from 3.3 billion to 4.6 billion by 2021. From now to 2021 the improvement equates to 61% of the global population using the internet.

Easy access has led to an ever-increasing use of the internet. Today, more and more information is uploaded from websites, and users download this information. Therefore, web hosting customers request more from web host providers. The inevitable result is that the controllability and manageability of the web host resources will be enhanced.

Web hosting companies provide service for individuals and organizations to create their websites. Web host companies provide space on a server so that the customer can upload the files of their website such as documents, videos, music, and images. The amount of this space is called storage usage. Web hosting providers have thousands of servers and millions of customers, and predicting the customer storage usage plays an important role in the quality of their service, customers satisfaction, energy saving, and maintenance management. Another great advantage of this prediction methodology is that the web hosting company can predict the achievable future cash flows of customers and it helps them to determine and analyze customer's portfolio, by calculating customer value. Customer value is defined as sum of customers’ discounted future cash flows.

The first step for predicting the storage usage of customers applies Poisson distribution theory which approximate to the Markov process. For instance, the forecasting model based on the autoregressive method is proposed by Barba and Rodríguez (2015). In reference to the autocorrelation properties of customer storage usage, applying the Poisson process is not a reliable method (Iliev and Bedzhev 2015; Morikawa and Tsuneda 2014). The linear model such as the Markov-modulated Poisson process and the moving average model are applied efficiently for short-term forecasting (Mai et al. 2014; Borchers and Langrock 2015) but by increasing the forecasting step the forecasting error will increase slowly (Hou et al. 2018). In fact, because of the strong non-linear behavior of web hosting customers, the customer storage usage model is a non-linear system and applying the classical prediction methods is not a reliable strategy for our purpose.

One of the most popular tools for predicting the non-linear and time varied behavior is the artificial neural network. The artificial neural network has some processing functions such as learning, memorizing, and computing, which make the artificial neural network a strong tool for solving non-linear problems (Hou et al. 2018). The other advantages of artificial neural network are memorizing the signal in a non-linear way, distributing processing and adapting abilities (Moretti et al. 2015).

The most important difference of previous research, including this research, is the complexity of the behavior of customers. On the one hand, in this problem the customers have three different types of behavior:

  1. 1.

    Customers have a positive storage usage (they upload data on server)

  2. 2.

    Customers have a negative storage usage (they delete the uploaded data on server)

  3. 3.

    Customers do nothing.

On the other hand, the variation of storage usage between customers is very large. The minimum storage usage is 4 KB and the maximum is 12.7 GB.

The purpose of this article is to apply fuzzy artificial neural network (FNN) for identifying a model that is optimal for predicting the absolute customer storage usage for each server. Therefor different training algorithms are compared and improved. Finally, it will be shown that improved algorithm is significantly better than the other algorithms.

This article is structured as follows: The FNN architecture is described in “The fuzzy neural network architecture” section. “Learning algorithm” section  discusses learning algorithms. The results of experiments are presented in “Results of the experiment” section. “Significance test” section contains the applied significance test and finally in “Conclusion” section the conclusion is presented.

The fuzzy neural network architecture

The radial basis function (RBF) neural network is one of the eminently appropriate methods which can model the non-linear systems and is applied in different types of research studies. In the following paragraphs, a brief history of the application of RBF will be presented.

In 2012, Fei and Ding proposed a new adaptive RBF neural network to control dynamic systems. The proposed adaptive RBF neural network is applied to train the upper bound of model uncertainties and external disturbances. Finally, the results of the experiment illustrate the stability of the closed-loop system (Fei and Ding 2012).

Huang and Ben-Hsiang (2014) presented a motion detection approach based on the RBF artificial neural networks to divide moving objects in a dynamic sense. The final evaluation results show that the recommended method succeeds in complete and accurate detection in both static and dynamic scenes (Hagan and Menhaj 1994).

Han et al. (2016) applied RBF neural network for non-linear system modeling. In contrast to the previously mentioned methods, a prediction approach, which is based on the RBF artificial neuron network, is applied to predict the wastewater treatment process.

Figure 1 illustrates the architecture of FNN, which Han proposed in 2016 (Han et al. 2016). The model consists of four layers. Each one has the number of neurons (nodes), and input and output values. The input value for each neuron is defined as is \(I_{b}^{L}\) and the output value is defined as \(O_{b}^{L}\). L is set as the name of the layer and b as the number of neurons.

Fig. 1
figure 1

The structure of fuzzy neural network (Han et al. 2016)

Fig. 2
figure 2

The weight between third and fourth layer

First layer:

  • The input value in the first layer is the amount of storage usage for each server in one week. The applied FNN structure predicts the amount of absolute storage usage for each server according to the input value. For instance, the input value can be the weekly storage usage during the last 70 days (10 input values), last 7 months (30 input values), or last year (52 input values). A collection of the servers during a specific time duration is monitored to predict the future customer storage usage.

    Because of the huge amount of the storage usage for different servers, all the input values are normalized and de-normalized between 0 and 1 before and after applying the model. The input and output values in the first layer are assigned as follows:

    $$I_{h}^{{{\text{Input}}L}} = x_{h}$$
    (1)
    $$O_{h}^{{{\text{Input}}L}} = I_{h}^{{{\text{Input}}L}}$$
    (2)

    In this equation h = 1, 2,…, k, k is the number of node (neuron) in the first layer (input layer) and x is [\(x_{1} ,x_{2} , \ldots ,x_{k}\)].

The second layer:

  • The second layer (fuzzy layer) is the RBF and the Gauss function which is applied for the fuzzy system.

    The definition of membership value µ(x) for the Gaussian membership function is as follows:

    $$\mu \left( x \right) = e^{{\frac{{ - \left( {x_{i} - \mu_{ij} } \right)^{2} }}{{2\sigma_{ij}^{2} }}}}$$
    (3)

    The input and output values of the RBF layer are defined as follows:

    $$I_{p}^{{{\text{RBF}}L}} = O_{p}^{{{\text{Input}}L}}$$
    (4)
    $$O_{p}^{{{\text{RBF}}L}} :\varphi_{j} \left( t \right) = \mathop \prod \limits_{i = 1}^{k} e^{{\frac{{ - \left( {x_{i} - \mu_{ij} } \right)^{2} }}{{2\sigma_{ij}^{2} }}}} = e^{{ - \mathop \sum \limits_{i = 1}^{k} \frac{{\left( {x_{i} - \mu_{ij} } \right)^{2} }}{{2\sigma_{ij}^{2} }}}} \quad i = 1, \, 2, \ldots , \, k;\, \, j = 1,2, \ldots ,p$$
    (5)

    where \(\varphi_{j} \left( t \right)\) is the output of the jth RBF nodes. \(\mu_{j}\) = \([\mu_{1j} ,\mu_{2j} , \ldots ,\mu_{kj} ]\) and \(\sigma_{j}\) = [\(\sigma_{1j} ,\sigma_{2j} , \ldots ,\sigma_{kj}\)] are the vectors of the membership and width of the jth RBF node.

The third layer:

  • The third layer is the normalized layer and the number of nodes here is equal to the number of nodes in the second RBF layer

    $$I_{p}^{{{\text{Normalized}}L}} = \varphi_{j} \left( t \right)\quad j = 1,2, \ldots ,p$$
    (6)
    $$O_{p}^{{{\text{Normalized}}L}} :\vartheta_{j} \left( t \right) = \frac{{\varphi_{j} \left( t \right)}}{{\mathop \sum \nolimits_{j = 1}^{p} \varphi_{j} \left( t \right)}} \quad j = 1,2, \ldots ,p$$
    (7)

The fourth layer:

  • The fourth layer is the output layer.

    $$I_{q}^{{{\text{Output}}L}} = \vartheta_{j} \left( t \right)$$
    (8)
    $$O_{q}^{{{\text{Output}}L}} :y_{i} = \mathop \sum \limits_{j = 1}^{p} w_{ji} \vartheta_{j} \left( t \right),\quad i = 1,2, \ldots ,q$$
    (9)

    where \(w_{ji}\) is the weight between the jth node (neuron) in the third layer (normalized layer) and ith node (neuron) in the fourth layer (output layer), the number of neurons in this layer (number of output value) depends on the operator and it can be 1, 2,….

    Finally, \(y_{i}\) is the output of the ith node in the fourth layer, which is calculated as in the equation below (Fig. 2):

    $$y_{i} = \frac{{\mathop \sum \nolimits_{j = 1}^{p} w_{ji} e^{{ - \mathop \sum \nolimits_{i = 1}^{k} \frac{{\left( {x_{i} - \mu_{ij} } \right)^{2} }}{{2\sigma_{ij}^{2} }}}} }}{{\mathop \sum \nolimits_{j = 1}^{p} e^{{ - \mathop \sum \nolimits_{i = 1}^{k} \frac{{\left( {x_{i} - \mu_{ij} } \right)^{2} }}{{2\sigma_{ij}^{2} }}}} }},\quad i = 1,2, \ldots ,q$$
    (10)

Learning algorithm

The adaptive second-order algorithm

The optimization method has an important role in the efficiency of the training process of neural networks and it affects the ability of neural networks, which depend on the size and the architecture of the network (Hornik et al. 1989) and thus, the limitation of training algorithms has a strong effect on the performance of neural networks (Reed and Marks 1999).

One of the most popular training algorithms is the Levenberg–Marquardt (LM) method (Hagan and Menhaj 1989). The proposed algorithm is the combination of the Gauss Newton method and gradient descent. Hagan and Menhaj (1989) tested the LM algorithm on several function-approximation problems and the results are compared with the conjugated gradient algorithm and with the variable learning rate backpropagation. The results show that LM is more efficient than backpropagation and conjugating the gradient in medium- to large-scale problems.

In 2002, the Levenberg–Marquardt algorithm is proposed with an adaptive momentum for training of feedforward neural network by Ampazis and Perantonis (2002). The algorithm is tested on learning tasks that are known for their difficulty. The final results show that the proposed algorithm can solve this task very successfully.

Following the computation procedure of Levenberg–Marquardt which is proposed by Ampatis and Perantonis (2002), the learning rule of the adaptive second-order algorithm (ASOA) is given by

$${{{\Theta}}}\left( {t + {1}} \right) = {{{\Theta}}}\left( t \right) + \left( {\Psi \left( t \right) + \lambda \left( t \right) \times I} \right)^{ - 1}*\Omega \left( t \right)$$
(11)

Ψ(t) is the quasi-Hessian matrix in this formula, \(\Omega \left( t \right)\) is the gradient vector, I is the unit matrix, and \(\lambda \left( t \right)\) is the adaptive learning rate.

In regard to the RBF FNN which is proposed in Sect. 2, Θ(t) there are three types of variables: the weight parameters (which are defined between the normalized layer and the output layer), the center (\(\mu_{ij}\)) and the width (\(\sigma_{ij}\)) of membership function (which are defined between the input layer and RBF layer) and the connection weight between the third and fourth layers \(w_{ji}\), therefore, the Θ(t) is defined as (Han et al. 2016)

$${{{\Theta}}}\left( t \right) = [\mu^{1} (t), \ldots ,\mu^{Q} (t),\sigma^{1} (t), \ldots ,\sigma^{Q} (t),w^{1} (t), \ldots ,w^{Q} (t)]$$
(12)

The weight parameters, the center of membership function, and the width of membership function can be optimized concurrently by ASOA-FNN.

The \(\Psi \left( t \right)\) (quasi-Hessian matrix) is defined as the summation of submatrices:

$${\Psi}\left( {\mathbf{t}} \right) = \mathop \sum \limits_{q = 1}^{Q} \Psi_{q} \left( t \right)$$
(13)

where the related submatrices are

$$\Psi_{q} \left( t \right) = j_{q}^{T} \left( t \right)j_{q} \left( t \right)$$
(14)

The Gradient vector ω(t), which is proposed by Han et al. (2016), is

$$\omega \, \left( {\text{t}} \right) \, = \mathop \sum \limits_{q = 1}^{Q} \omega_{q} \left( t \right)$$
(15)

where the related subvectors are

$$\omega_{q} \left( t \right) = j_{q} \left( t \right)e_{q} \left( t \right)$$
(16)

\(e_{q} \left( t \right)\) is the error of the qth neuron and it is the difference between the output layer and the expected value of qth neuron:

$$e_{q} \left( t \right) = y_{q} \left( t \right) - \widehat{{g_{q} \left( t \right)}}$$
(17)

\(j_{q} \left( t \right)\) is accumulated as

$$j_{q} \left( t \right) = \left[ {\frac{{\partial e_{q} \left( t \right)}}{{\partial \mu ^{1} \left( t \right)}},\frac{{\partial e_{q} \left( t \right)}}{{\partial \mu ^{2} \left( t \right)}}, \ldots \frac{{\partial e_{q} \left( t \right)}}{{\partial \mu ^{Q} \left( t \right)}},\,\frac{{\partial e_{q} \left( t \right)}}{{\partial c^{1} \left( t \right)}},\frac{{\partial e_{q} \left( t \right)}}{{\partial c^{2} \left( t \right)}}, \ldots \frac{{\partial e_{q} \left( t \right)}}{{\partial c^{Q} \left( t \right)}},\,\frac{{\partial e_{q} \left( t \right)}}{{\partial w^{1} \left( t \right)}},\frac{{\partial e_{q} \left( t \right)}}{{\partial w^{2} \left( t \right)}}, \ldots \frac{{\partial e_{q} \left( t \right)}}{{\partial w^{Q} \left( t \right)}}} \right]$$
(18)

λ(t) is the adaptive learning rate and formalized as (Han et al. 2016)

$$\lambda (t) = \mu (t)\lambda (t - 1),\quad 0 < \lambda (t) < 1$$
(19)

Han et al. (2016) propose µ(t) as

$$\mu (t) = (\tau^{\min } \left( t \right) + \lambda \left( {t - 1} \right))/(\tau^{\max } \left( t \right) + 1),\quad 0 < \tau^{\min } \left( t \right) < \tau^{\max } \left( t \right)$$
(20)

\(\tau^{{\min}} \wedge \tau^{{\max}}\) which are defined as the maximum and minimum eigenvalue of Θ(t) (Han et al. 2016).

Improved ASOA-FNN

The adaptive learning rate plays an important role in the basic ASOA-FNN process in that the small learning rate can improve the algorithm’s local searching ability, while the bigger adaptive learning rate can enhance the ability of the global search. A continuous decrease of λ(t) helps ASOA-FNN to have an effective global and local search and it avoids local extremum. Therefore, we need to stabilize the decrease rate of learning in order to maintain a productive learning rate. Thus, this article adopts a new learning rate as is shown in Formulae (21) and (22).

$$\tau_{{{\text{average}}}} = (\tau^{{\min}} \left( t \right) + \tau^{{\max}} \left( t \right))/2,\quad 0 < \tau^{{\min}} \left( t \right) < \tau^{{\max}} \left( t \right)$$
(21)

The following, \(\tau_{{{\text{average}}}}\) represents the average of maximum and minimum eigenvalue of Θ(t) and µ(t) is defined as

$$\mu \left( {\text{t}} \right) = \, (\tau_{{{\text{average}}}} )/(\tau_{{{\text{average}}}} + 1)$$
(22)

Differential evolution

Differential Evolution (DE) algorithm is a type of Genetic algorithm, which is proposed by Storn and Price (2005). The value of variables in the DE algorithm is represented by a real number. The search technique in DE is based on population evolution. DE randomly chooses the initial population (\(X^{0}\) = \([x_{1}^{0} ,x_{2}^{0} , \ldots ,x_{NP}^{0} ]\)), in which NP is the size of pool. Pool is a set of problem variables. The variables of this model are the center of membership function, the width of membership function, and the connection weight between the third and fourth layer. After a series of operations (mutation, crossover, and selection), the pool of the jth generation improves to \(x_{i}^{t}\) = \([x_{i,1}^{j} ,x_{i,2}^{j} , \ldots ,x_{i,NP}^{j} ]\).

The three important types of operation in DE are mutation, crossover, and selection.

Mutation operation

Avoiding evolution from local optimal solution is the most important role of mutation. Constant mutation operator has some disadvantages. For example, with too high a mutation rate, the algorithm search is too random and it results in a massive decrease of searching efficiency. Low accuracy of the globally optimal solution is the underlying cause of too randomness of an algorithm search.

Consequently, Hou et al. (2018) proposed the adaptive mutation rate as

$$F \, = F_{\min } + \frac{{g_{\max } - g}}{{g_{\max } }}*F_{\max },$$
(23)

where \(F_{{\min}}\) = 0.1 and \(F_{\max }\) = 0.9

Crossover operation

Maintaining the diversity of population is the main goal of a crossover operation. To generate the trial vector in a crossover operation, two vectors will be chosen. The trial vector is the child of vectors, which is calculated by mutation operation and the vector, which is chosen randomly from the pool. The child vector inherits the parameters from parents with the crossover constant probability (CR). For instance, when the crossover constant is equal to 0, the trial vectors come from the parent, which is chosen randomly from the pool. On the other hand, when the crossover constant is equal to 1, the trial vectors inherit from \(X_{m}\).

The size of the crossover probability factor plays a critical role in the DE algorithm. On one hand, the algorithm’s local searching ability is decreased through a small crossover rate and on the other hand, the diversity of the pool and the global convergence of algorithms is improved through a big crossover rate.

Hou et al. (2018) proposed an adaptive crossover rate, which is defined as

$${\text{CR }} = {\text{CR}}_{{\min}} + \frac{g}{{g_{{\max}} }}*\left( {{\text{CR}}_{{\max}} - {\text{CR}}_{{\min}} } \right)$$
(24)

They defined \({\text{CR}}_{{\max}} ,\,{\text{CR}}_{{\min}}\) as preset numbers and \({\text{CR}}_{{\max}} = 0.9,{\text{CR}}_{{\min}}\) = 0.

Selection operation

The value of the objective function of the target vector and the trial vector is compared. If the trial vector has the lower objective function, it will be replaced with the target vector.

Backpropagation algorithm

Backpropagation (BP) algorithm is the wildly popular learning algorithm for the neural networks with more than one hidden layer because of its simplicity and effectiveness (Rumelhart et al. 1986). It is used to determine a gradient which is needed in the calculation of the weights between the hidden layers in the network. The amount of learning rate has a crucial role in searching for the local optimal solution of neural network parameters. For instance, a slow convergence is the result of a low learning rate and divergence is the result of a high learning rate. Duffner and Garcia (2007) proposed an online BP algorithm with an adaptive global learning rate. The “blod driver” method is the main idea of the adaption of the learning rate (Battiti 1989). In this article, an online BP algorithm with an adaptive global learning rate, which is proposed by Duffner and Garcia (2007), is applied.

Differential evolution–Backpropagation algorithm

Hou et al. (2018) proposed a new FNN training algorithm to optimize the FNN parameters. In this algorithm, the improved DE algorithm and BP algorithm are combined. They used the improved DE algorithm to gain the suboptimal solution or global optimal solution of the FNN parameters. The BP algorithm is applied to improve the local optimal solution of FNN (Han et al. 2016).

In this article, improved DE algorithm (Han et al. 2016) and BP algorithm (Duffner and Garcia 2007) are combined and the results of improved ASOA, ASOA, DE, BP, DE–BP algorithms and other prediction methods such as AR, ARMA, and Mackey glass time series are compared.

The applied objective function in all the algorithms is the Sphere function:

$$f\left( x \right) = \mathop \sum \limits_{j = 1}^{N} \mathop \sum \limits_{i = 1}^{npw} \left( {{\text{expected}}_{{{\text{value}}}} - {\text{output}}_{{{\text{value}}}} } \right)_{i}^{2}$$
(25)

Results of the experiment

Experimental data of the project are from a famous web hosting database. Monitoring time is from May 5, 2018 to February 20, 2019 (31 weeks), and every week the average storage usage for each server is collected. The 2432 servers, that have a complex behavior in using the storage, are taken as the sample for the experiment. For the 2432 servers, the first 2200 servers (90% of all the servers) are used for the training process, and the remaining 232 servers (10% of the servers) are used for the testing process. We run the algorithms for node 11 and the number of nodes in output layer is defined as 8 (Fig. 3).

Fig. 3
figure 3

The input value

Performance of each algorithm is assessed by the root mean square error (RMSE), the average percentage error (APE), and the running time duration. They are presented as (26)–(27):

$${\text{RMSE}} = \sqrt {\frac{1}{N*npd}\mathop \sum \limits_{i = 1}^{{{\text{number}}\,{\text{of}}\,{\text{customer}}}} \mathop \sum \limits_{i = 1}^{{{\text{npd}}}} \left( {{\text{output}}_{{{\text{value}}}} - {\text{expected}}_{{{\text{value}}}} } \right)^{2} }$$
(26)
$${\text{APE}}\left( t \right) = \mathop \sum \limits_{t = 1}^{N} \frac{{\parallel}e\left( t \right){\parallel}}{{{\parallel}y\left( t \right){\parallel}}}*100{\% }$$
(27)

The ASOA initial leaning rate is assumed as 0.999. The learning iteration for each algorithm is 2000. In DE–BP algorithm the 1000 iterations for each algorithm is set. The details are presented in Table 1.

Table 1 The result of RMSE for fuzzy neural network algorithms

The results show, the predicting values gained from the improved ASOA-FNN model is more accurate than other algorithms. The predicting values from the improved ASOA-FNN are compared with those from ASOA-FNN (Han et al. 2016), DE-FNN (Hou et al. 2018), BP-FNN (Duffner and Garcia 2007), and DE–BP-FNN (Hou et al. 2018; Duffner and Garcia 2007).

The results of experiments show that the improved ASOA-FNN has the smaller RMSE and APE than ASOA-FNN, DE-FNN, BP-FNN, and DE–BP-FNN, but the running time of improved ASOA-FNN and ASOA-FNN is higher than the other algorithms. Another important result is that the DE-FNN has smaller RMSE than DE–BP-FNN. In this problem, the RMSE of the DE algorithm with 2000 iteration is less than the RMSE of a combined algorithm with 1000 DE iterations and 1000 BP iterations.

In Table 2, there is a comparison regarding the predicting values of other methods such as Mackey Glass time series, autoregression (AR), autoregressive moving average (ARMA), with the biggest RMSE of improved ASOA-FNN, ASOA-FNN, DE-FNN, BP-FNN, DE–BP-FNN.

Table 2 The results of other prediction methods and fuzzy neural network

The results show that the improved ASOA, ASOA, DE, and DE–BP algorithms have a smaller RMSE than other prediction methods (AR, ARMA, and Mackey Glass time series) and the Mackey Glass time series has a smaller RMSE than the BP-FNN.

Significance test

In this part, the predictive performance of improved ASOA-FNN with other algorithms (ASOA, DE, BP, and DE–BP) will be compared by significance test methods. The main point of evaluating the predictive performance of a model is as follows (Raschka 2018):

  1. 1.

    Estimating the generalization performance and the predictive performance of an improved and identified algorithm.

  2. 2.

    Identifying the machine learning algorithm, that is suitable for our model and has the best performance.

The Wilcoxon signed-rank test, F test, and Morgan–Gragner–Newbald test (Diebold and Mariano 1995) are implemented at the 0.01, 0.05, and 0.1 significance level in one-tailed test to confirm the performance of the improved ASOA. The test helps us to evaluate the predictive ability of two different learning algorithms and the statistically significant difference between the results is considered.

The performance metric RMSE is selected to carry out the non-parametric test for making a comparison of the forecasting performance of improved ASOA, ASOA, DE, BP, and DE–BP. The comparison results are illustrated in Tables 3, 4, 5, 6. The end results show that the improved ASOA achieves statistical significance in contrast to ASOA, DE, BP, and DE–BP at the 0.01,Footnote 2 0.05Footnote 3, and 0.1Footnote 4 level. In Table 3, the F test between improved ASOA and ASOA with sample size 5 and 10 is not significant but by increasing the sample size to 15 and 20, it is significant. The results of this significance test illustrate that the predict performance of improved ASOA is better than the other algorithms.

Table 3 Significance test between improved ASOA and ASOA
Table 4 Significance test between improved ASOA and DE
Table 5 Significance test between improved ASOA and DE–BP
Table 6 Significance test between improved ASOA and BP

Figure 4 shows the typical evolution of RMSE during the training process for node 11 in improved ASOA-FNN by using the applied learning algorithms. As shown in the figure, at the beginning of the learning process the RMSE is high but after some iterations it drops and finally converges to a minimum value.

Fig. 4
figure 4

Evolution of training RMSE in improved ASOA-FNN

Conclusion

Storage usage forecasting will provide reliable data support for resource management and resource planning, and storage usage forecasting technology is an effective means to optimize the resources. So, for the web hosting experts who work in the field of web hosting resources, they must first give attention to a forecasting method. On this basis, customer value can be determined, and customer portfolio will be optimized.

In this article a RBF-FNN architecture has been applied and some learning algorithms such as ASOA, DE, BP, and DE–BP have been applied. The applied ASOA-FNN, which is proposed by Han et al. (2016) has been improved. Three significance tests have been implemented to confirm the performance of the improved ASOA and other applied algorithms. The end results show that the improved ASOA outperforms ASOA, DE, BP, and DE–BP by statistic significance tests.

The other classical prediction methods such as autoregressive, autoregressive moving average, and Mackey Glass time series have been implemented and the results have been compared with the algorithms which are applied to FNN. The comparisons of the classical prediction methods and FNN demonstrate that the learning efficiency and performance of the improved ASOA-FNN are better than others.

Considering an intelligent data mining algorithm, which can cluster the customers in different types of lifecycle in relation to predicting the storage usage and other resources of servers such as CPU usage will be vital in future.