Introduction

In this data era where data is the new oil, internet data traffic is growing significantly each year [1, 2]. With the advent of state of the art technologies on data transmission and processing in the last decade, the internet has witnessed an increase in the intensity and the volume of internet activities globally [3]. User-generated dataset contains useful statistics and information that can be harnessed for learning but this may be challenged by privacy issues [4, 5]. Internet activities generate data traffic of various kinds; during both data download and upload. Monitoring and analysis of internet traffic is becoming more challenging daily due to sheer increase in the volume of the internet data traffic and the large capacity of connection trunks [2].

Internet traffic measurement and management is vital to the operations of Internet Service Providers for predicting future demands [6], and traffic monitoring can be achieved using flow statistics tools. Internet traffic measurement is typically deployed by capturing process packet at a particular data monitoring point using high performance central servers and specialized tools such as Flowscan and Coralreef [7]. Internet traffic monitoring over a large network, e.g. a state-wide computer network, produces huge volume of data which may be time intensive to analyse especially in cases of global, worm and virus attacks. Hence, it is vital to ensure that an optimal methodology is deployed for traffic monitoring [8], and for generating flow statistics [2, 9, 10] using flow aggregation and packet sampling methodologies in place of continuous sampling. An innovative approach for analysing internet traffic flow using timestamp data which generates traffic analysis cookie was developed by [11].

The study by Kim et al. [12] emphasized the importance of traffic classification for uniquely identifying data traffic of certain types that ought to be blocked toward ensuring network security, and also, for preventing malicious activities [13, 14] and programs [15, 16]. In the study, machine learning algorithms using WEKA application were applied in carrying out the performance evaluation of the seven most commonly used learning algorithms for traffic classification. According to K. Claffy and Monk [17], and Kim et al. [12] there is no industry norm or standard format for comparing the performance of a network with another and neither is there a defined, best traffic classification method to apply, and as such, for the success of commercial internet the only baseline available through which organisations may be able to calibrate the performance of their network is by referencing past network performance data. This therefore emphasises the need to monitor and log internet data traffic for a comparative network performance analysis.

Apart from the analysis of internet traffic for network security reasons, internet data traffic carries a lot of useful information about the originating network. The daily volumetric variation of internet traffic creates usage pattern that can be deployed for predictive analysis which will help network engineers in preparing the network adequately for anticipated heavy internet traffic so as to ensure optimal quality of service [18,19,20]. Also, the quality of packet traffic may be impaired by packet losses [21,22,23,24]. The peak and off-peak internet usage periods can be determined from monitored networks and such information is vital for planning. Likewise, the capacity of the network to meet rising traffic demand can be easily observed, and this will help the network managers to respond proactively to likely future network issues due to network overloading by excessive traffic [25] and appropriate mitigations, control and possible network expansion can be deployed in a timely manner.

The study by Tokuyama et al. [26] proposed the use of day of the week and time as features for improving network prediction accuracy using Recurrent Neural Network. In [27], deep neural network was applied for predicting internet traffic by analysing the aggregated traffic data logged over a year period. The feasibility of using non-linear time series analysis for internet traffic prediction was demonstrated in the extensive study by [28]. Extracting useful information from data traffic can take different forms such as time series models, regression analysis, machine learning, and so forth. In this study, data mining algorithms were deployed for a classification analysis of the internet data traffic of a smart-community compliant private university in Nigeria for a period of one year ranging from January to December 2017.

Studies on internet traffic over time have provided various methods for improving internet traffic flow monitoring statistics and computation time [7] with focus often on traffic analysis, trend monitoring, traffic classification for categorising traffic types, and for identifying threats and malicious traffic [6, 29]. Traffic volume prediction is another vital aspect of network monitoring which is often analysed using time series (linear and non-linear), regression analysis, decomposition methods, hybrid methods etc. [26, 30, 31], and this provides an opportunity for further related studies using alternative methods, tools and features. Traffic data can be tracked and analysed over specific time interval, e.g. in minutes or hourly over a 24-h period. The focus of this study is on predictive analysis of internet traffic data using aggregated daily upload and download IP traffic over a year. Internet traffic data can be examined using tools such as neural network, time series statistics, deep learning, etc. In this study, predictive data mining models will be developed using the interactive data pipeline workflows and visual programming on KNIME and Orange platforms [32, 33]. This paper is a case study analysis that is focused on identifying unique internet traffic data trends within a university environment, and this provides an opportunity for enhancing the quality of daily service through anticipated traffic prediction. The study implements data mining analysis using the latest visual programming tools that does not demand rigorous coding, and as such, it demonstrates an alternative approach to the traditional extensive code-based data mining methods, and this can be easily implemented by network engineers for predicting daily internet traffic using well defined traffic status classification.

Data acquisition and methodology

Valuable information that can guide decision making, and the efficiency and productivity of operational processes can be extracted from historical dataset of systems and processes by applying data mining methodologies. Databases are rich sources of historical information, and as such, useful knowledge can be obtained by analysing the accumulated dataset [34, 35]. Data mining entails the use of computer applications for applying various learning algorithms that identify patterns within the dataset [36]. Data mining is a broad field that encompasses computer science and statistics. In the study by Auld et al. [6], the use of supervised learning for classifying internet traffic data was demonstrated using a trained Bayesian Neural Network, and accuracies of 95% and 98% respectively were achieved for the cases considered. Naıve Bayes classifier was applied by [37] in the basic form for classifying internet traffic, and an accuracy of 65% was achieved, also sophisticated refinements were proposed for improving the predictive accuracy. An untrained classifier was applied by McGregor et al. [38] for identifying classes of traffic with similar properties for clustering into unique groups [39]. In the study by Soule et al. [40], data flow analysis was carried out by classifying traffic into elephant flows and non-elephant flows for estimating the probability of flow-membership.

In this study, the internet traffic data of Covenant University in Nigeria over a period of one year was evaluated and analysed using predictive data mining algorithms. The data was logged using Mikrotik Hotspot Manager and FreeRADIUS, Radius Manager Web application deployed on LINUX platform as implemented by Adeyemi et al. [18] through the SmartCU cluster. The dataset logged contains the Upload (in GigaBytes) and the download (in GigaBytes) internet traffic data from the 1st of January to the 19th of December when the school closed for the year in 2017. During data preparation, the actual day of the week (Monday to Sunday) was captured to allow the model to identify any hidden unique data usage pattern for each day of the week within a specific week and month. Covenant University runs a stable academic calendar which is fixed for each year, and as such there is a high tendency that specific academic activities within the university might be causal factors influencing internet traffic for each day. Hence, if such unknown, regular, daily-activity driven internet usage patterns were identified, it would be easy using the acquired knowledge to forecast the anticipated data usage for any specific day and date in the next academic year. This forecast information will help network engineers prepare adequately towards maintaining top-notch quality of service. To achieve this goal, an extensive methodology was deployed to process the dataset, and this comprises data cleaning, data sorting, extraction of descriptive statistics, data normalization and coding, and implementation of classification algorithms to train and classify the data and evaluate the performance of the algorithms.

From the yearlong dataset, four unique quarters were identified as shown in Figs. 1 and 2. The quarters are based on the minimum, the lower quartile, the median, the upper quartile and the maximum values of each parameter. Based on the quartiles, the internet traffic for each day was classified into four categories as shown in Table 1.

Fig. 1
figure 1

Box plot of the download internet traffic. The boxplot shows the variation in the daily download internet traffic for a year across four quartiles

Fig. 2
figure 2

Box plot of the upload internet traffic. The boxplot shows the variation in the daily upload internet traffic for a year across four quartiles

Table 1 Internet data traffic classification

The model

Six features were analysed in the data mining model for predicting the IP download traffic and these are: month, week (week 1 to week 51), the day of the week (Monday to Sunday), the daily download traffic for the previous day, the average daily download traffic for the two previous days, and the TSC for the download internet traffic data. Likewise, for the IP upload traffic the following features were considered: month, week, the day of the week, the daily upload traffic for the previous day, the average daily upload traffic for the two previous days, and the TSC for the upload internet traffic data. The data mining analysis was performed using four learning algorithms: Tree Ensemble, Decision Tree, Random Forest, and Naïve Bayes learner and predictor nodes on KNIME data mining application, and K-nearest neighbour (kNN), Random Forest, Neural Network, Naïve Bayes and CN2 Rule Inducer on the Orange data mining platform. The KNIME and Orange data mining platforms were combined in this study for an extensive analysis, and to identify significant variations in result between the two platforms, if any.

For the whole year, internet traffic data samples were captured and analysed for 353 days. 70% of the data samples were used for training the learning algorithm while the remaining 30% was applied for evaluating the performance of the trained model. The dataset was imported into the model using the Excel Reader. The numeric parameters were normalized to prevent size-based bias at the learning stage. The processed dataset was applied to the configured learner algorithms and the model results were exported for evaluation. The KNIME-based model showing the data workflow is available in Appendix as Fig. 18.

Based on the confusion matrix generated by each predictive data mining algorithm; model performance measures such as the accuracy, the F-measure, etc. can be determined using Eqs. 1 to 6 [41,42,43]. Given that the correctly predicted positive samples are referred to as True Positive (TP), the incorrect positive predictions as False Positive (FP), correctly predicted negative samples as True Negative (TN), and incorrect negative predictions as False Negative (FN). The accuracy of the machine learning algorithm as expressed in Eq. (1) is the percentage of the correct predictions made by the model with respect to the total number of predictions.

$${\text{Accuracy}} = \frac{{\left( {TP + TN} \right)}}{{\left( {TP + FP} \right) + (TN + FN)}}$$
(1)

A dataset is said to be unbalanced when the number of instances is significantly unequal among the classes or when a particular instance is not observed at all. Imbalance ratio varies from dataset to dataset, and it may create a bias towards the majority class. The use of accuracy as a performance measure is inadequate for unbalanced dataset. For such cases, the balanced accuracy is more suitable as defined in Eq. (2).

$${\text{Balanced Accuracy}} = \frac{1}{2} \times \left( {\frac{TP}{TP + FN} + \frac{TN}{TN + FP}} \right)$$
(2)

For each class, the precision is the number of correctly classified samples out of the total samples classified in that particular class. It is mathematically defined in Eq. (3).

$${\text{Precision}} = \frac{TP}{TP + FP}$$
(3)

For each class, the recall is the number of correctly classified samples out of the total samples that are truly in that particular class. It is mathematically defined in Eq. (4).

$${\text{Recall}} = \frac{TP}{TP + FN}$$
(4)

The F-measure or F-score is the harmonic mean of the recall and the precision as defined in Eq. (5).

$${\text{F - measure}} = \left( {\frac{{recall^{ - 1} + precision^{ - 1} }}{2}} \right)^{ - 1}$$
(5)

The error rate of the machine learning algorithm is defined by Eq. (6)

$${\text{Error}} = \frac{{\left( {FP + FN} \right)}}{{\left( {TP + FP} \right) + (TN + FN)}}$$
(6)

The traffic status classification of the aggregated IP traffic flow Q(n) for day(n) in the university under study is mapped using data mining classification as a function of knowledge acquired from five key variables (day, week, month, the traffic for the previous day, and the average daily traffic for the two previous days) as expressed in Eqs. (7) and (8) for the daily upload and download internet traffic respectively, where Q(n − 1) → Q(n) → Q(n + 1) implies daily traffic variation.

$$TSC\;Q_{u} (n) = \left[ {day(n),week(n),month(n),Q_{u} (n - 1),\frac{{Q_{u} (n - 1) + Q_{u} (n - 2)}}{2}} \right]$$
(7)
$$TSC\;Q_{d} (n) = \left[ {day(n),week(n),month(n),Q_{d} (n - 1),\frac{{Q_{d} (n - 1) + Q_{d} (n - 2)}}{2}} \right]$$
(8)

Descriptive statistics of the dataset

The statistical properties of the dataset are summarized in this section. Table 2 presents the descriptive statistics of the internet traffic data while Table 3 presents the parameters of the Logistic Distribution model which was used to fit the internet download traffic data. Table 4 shows the Logistic Distribution model parameters for fitting the internet upload traffic data. The Internet traffic variations across the 51 weeks is presented in Fig. 3 for the download traffic and in Fig. 4 for the upload traffic. The average, weekly internet traffic size for the download and upload IP traffic is presented in Fig. 5. Figures 6 and 7 show the probability density plot and the cumulative probability plot of the internet download traffic data while Figs. 8 and 9 show the probability density plot and the cumulative probability plot of the internet upload traffic data.

Table 2 Descriptive statistics of the internet traffic data for the year 2017
Table 3 Logistic distribution fitting model parameters for the internet download traffic
Table 4 Logistic distribution fitting model parameters for the internet upload traffic
Fig. 3
figure 3

Internet download traffic variations across the 51 weeks

Fig. 4
figure 4

Internet upload traffic variations across the 51 weeks

Fig. 5
figure 5

Average weekly internet traffic size for the download and upload data

Fig. 6
figure 6

Probability density plot of the internet download traffic

Fig. 7
figure 7

Cumulative probability plot of the internet download traffic

Fig. 8
figure 8

Probability density plot of the internet upload traffic

Fig. 9
figure 9

Cumulative probability plot of the internet upload traffic

Results and discussion

The Decision Tree, the Tree Ensemble, the Random Forest, and the Naïve Bayes learners on KNIME platform were trained using 70% of the dataset. On the Orange platform; the kNN, Neural Network, Random Forest, Naïve Bayes and CN2 Rule Inducer data mining algorithms were trained using 70% random sampling with stratified shuffle split which ensures that the percentage of the samples for each class is preserved in the training and testing data divisions. The result of the predictive model evaluation using the remaining 30% of the data is presented in this section. The predictive analysis was carried out in two parts: for the download and the upload traffic data using the four predictive learners for each as presented in the following sections. The KNIME workflow implemented for the classification analysis is presented in the Appendix as Fig. 18.

Results for the KNIME based model

A. Internet download traffic data

  1. i.

    The Ensemble Tree Algorithm

    The Ensemble Tree learner was able to accurately predict the Traffic Status Classification (TSC) for 62.264% of the test samples. The confusion matrix for the Ensemble Tree predictor is presented in Table 5.

    Table 5 Confusion matrix for the Tree Ensemble predictor
  2. ii.

    Decision Tree Algorithm

    The Decision Tree learner was able to accurately predict the Traffic Status Classification (TSC) for 55.66% of the test samples. The confusion matrix for the Decision Tree predictor is presented in Table 6.

    Table 6 Confusion matrix for the Decision Tree Predictor
  3. iii.

    Random Forest Algorithm

    The Random Forest learner was able to accurately predict 60.377% of the model evaluation test samples with a Cohen’s Kappa (k) value of 0.465. The confusion matrix for the Random Forest predictor is presented in Table 7.

    Table 7 Confusion matrix for the Random Forest Predictor
  4. iv.

    Naïve Bayes Algorithm

    The Naïve Bayes Algorithm is a probabilistic classifier which applies the Bayes theorem with naïve independence assumptions among the classified features. The Naïve Bayes Algorithm accurately predicted 59.434% of the total test samples with a Cohen’s Kappa value of 0.454. The confusion matrix for the Naïve Bayes predictor is presented in Table 8.

    Table 8 Confusion matrix for the Naïve Bayes Predictor

B. Internet upload traffic data

  1. i.

    The Ensemble Tree Algorithm

    Similar to the prediction for the internet download traffic analysis, the Ensemble Tree Algorithm was able to accurately predict the Traffic Status Classification for 62.264% of the model evaluation test samples. The confusion matrix for the Ensemble Tree predictor is presented in Table 9. A comparison of Tables 5 and 9 for the Ensemble Tree Algorithm shows that although the accuracy for both the internet upload and download traffic prediction are the same but the items misclassified in both cases are different.

    Table 9 Confusion matrix for the Ensemble Tree Predictor
  2. ii.

    Decision Tree Algorithm

    The Decision Tree learner for the upload IP traffic had a predictive accuracy of 55.66%. The confusion matrix for the Decision Tree predictor is presented in Table 10.

    Table 10 Confusion matrix for the Decision Tree Predictor
  3. iii.

    Random Forest Algorithm

    The Random Forest learner was able to accurately predict 63.208% of the model evaluation test samples with a Cohen’s Kappa (k) value of 0.51. The confusion matrix for the Random Forest predictor is presented in Table 11.

    Table 11 Confusion matrix for the Random Forest Predictor
  4. iv.

    Naïve Bayes Algorithm

    The Naïve Bayes Algorithm accurately predicted 62.264% of the test samples with a Cohen’s Kappa value of 0.497. The confusion matrix for the Naïve Bayes predictor is presented in Table 12.

    Table 12 Confusion matrix for the Naïve Bayes Predictor

The comparison of the performances of the KNIME based Decision Tree, Tree Ensemble, the Random Forest, and the Naïve Bayes learners is presented as a summary in Tables 13 and 14. The F-measure statistics is presented in Table 15.

Table 13 The confusion analysis for the four machine learning algorithms on KNIME platform
Table 14 Comparison of the performance of the four data mining algorithms on KNIME platform
Table 15 Comparison of the F-measure statistics

Results for the Orange data mining platform

Orange is an open source data mining and machine learning software for explorative data analysis using visual programming. According to the developers, Orange is a fruitful and fun way of deploying data mining interactively for fast qualitative data analysis. Five machine learning algorithms were applied on the Orange platform to explore the upload and download IP traffic data and these are: kNN, Random Forest, Neural Network, Naïve Bayes and CN2 Rule Inducer algorithm. The samples were randomly selected using stratified shuffle split and the result of the analysis is presented in the following sections using the average over classes. The performance of the algorithms is compared using the Classification Accuracy (CA), Area under ROC Curve (AUC), the Precision rate, the Recall, and the F1 score. The Orange workflow is presented in the Appendix section as Fig. 19.

Internet Download Traffic Data

Table 16 shows a comparative performance analysis for the five machine learning algorithms deployed on the Orange platform for analysing the download internet traffic data. For a visual appreciation of the variation in the performance of each of the machine learning algorithms on the Orange platform, the AUC is presented using the receiver operating characteristic (ROC) curve which is a probability curve that plots sensitivity; that is, the true positive rate on the y-axis against the false positive rate (1-specificity). The ROC curve is plotted in Fig. 10 for the heavy data traffic (HDT) internet download, IP traffic status classification while Fig. 11 shows the ROC curve for the moderate data traffic (MDT) internet download, IP traffic status classification. Figures 12 and 13 present the ROC curve for the internet download, IP traffic status classification for the slight data traffic (SDT) and low data traffic (LDT) respectively.

Table 16 Comparative evaluation of the performance of the data mining algorithms using Orange software
Fig. 10
figure 10

ROC for the HDT IP download TSC

Fig. 11
figure 11

ROC for the MDT IP download TSC

Fig. 12
figure 12

ROC for the SDT IP download TSC

Fig. 13
figure 13

ROC for the LDT IP download TSC

Internet upload traffic data

Table 17 shows a comparative performance analysis for the five data mining algorithms deployed on the Orange platform for the upload IP traffic. For the internet upload IP traffic, the ROC curve is plotted in Fig. 14 for the HDT internet upload, IP traffic status classification while Fig. 15 shows the ROC curve for the MDT internet upload IP traffic status classification. Figures 16 and 17 present the ROC curve for the SDT and LDT respectively.

Table 17 Comparative evaluation of the performance of the data mining algorithms using Orange software
Fig. 14
figure 14

ROC for the HDT IP upload TSC

Fig. 15
figure 15

ROC for the MDT IP upload TSC

Fig. 16
figure 16

ROC for the SDT IP upload TSC

Fig. 17
figure 17

ROC for the LDT IP upload TSC

Summary of the models’ predictive performance

In terms of predictive accuracy, for the internet download traffic, the order of model accuracy is as follows for the KNIME-based model: Tree Ensemble > Random Forest > Naïve Bayes > Decision Tree while for the internet upload traffic the order is Random Forest > Tree Ensemble = Naïve Bayes > Decision Tree. The analysis shows that the Decision Tree predictor had the worst performance in both cases which implies that the Decision Tree Algorithm may not be very optimal for predicting internet data traffic using historical internet traffic data without modifications to the model. For the Orange data mining platform, in terms of the AUC for the download traffic, the order of performance is as follows: Naive Bayes > Neural Network > Random Forest > kNN > CN2 rule inducer while for the upload traffic the order is Naive Bayes > Random Forest > Neural Network > kNN > CN2 rule inducer.

Conclusion

Internet data traffic monitoring and measurement is vital to the operations of Internet Service Providers, and this can be achieved using flow-based traffic monitoring approach. The logged internet traffic data acquired through traffic monitoring contains useful information and knowledge which can be accessed via data analysis. In this study, the upload and download internet traffic data generated in Covenant University, in Nigeria for the year 2017 was statistically analysed and predictive KNIME and Orange based models were developed for forecasting internet data traffic on a given day using the traffic data of the previous days. The Tree Ensemble, the Decision Tree, the Random Forest, and the Naïve Bayes data mining algorithms were applied on the KNIME model while the Naive Bayes, Neural Network, Random Forest, kNN and the CN2 rule inducer were applied on the Orange platform as a supervised-learning data mining model for predictive analysis.

The algorithms were effectively trained with 70% of the dataset samples while the remaining 30% was applied for model evaluation. The model performance evaluation result shows that the Tree Ensemble predictor had the best accuracy while the Decision Tree predictor had the least accuracy for the internet download prediction on KNIME. The Naïve Bayes and the Tree Ensemble predictors had the same accuracy for the internet upload traffic, and the Decision Tree predictor once again had the least accuracy for the upload traffic analysis on KNIME. The least accuracy recorded for all the cases considered is 55.66% while the maximum accuracy is 63.208%. This shows that data mining approach using interactive, visual data pipeline workflows is reasonably accurate for predicting internet traffic trends in a smart university but further studies will be required in order to improve the performance of the models.