Statistical and probabilistic analysis of interarrival and waiting times of Internet2 anomalies

Kokoszka, Piotr; Nguyen, Hieu; Wang, Haonan; Yang, Liuqing

doi:10.1007/s10260-019-00500-x

Statistical and probabilistic analysis of interarrival and waiting times of Internet2 anomalies

Original Paper
Published: 21 November 2019

Volume 29, pages 727–744, (2020)
Cite this article

Statistical Methods & Applications Aims and scope Submit manuscript

Piotr Kokoszka¹,
Hieu Nguyen¹,
Haonan Wang¹ &
…
Liuqing Yang²

276 Accesses
6 Citations
Explore all metrics

Abstract

Motivated by the need to introduce design improvements to the Internet network to make it robust to high traffic volume anomalies, we analyze statistical properties of the time separation between arrivals of consecutive anomalies in the Internet2 network. Using several statistical techniques, we demonstrate that for all unidirectional links in Internet2, these interarrival times have distributions whose tail probabilities decay like a power law. These heavy-tailed distributions have varying tail indexes, which in some cases imply infinite variance. We establish that the interarrival times can be modeled as independent and identically distributed random variables, and propose a model for their distribution. These findings allow us to use the tools of of renewal theory, which in turn allows us to estimate the distribution of the waiting time for the arrival of the next anomaly. We show that the waiting time is stochastically substantially longer than the time between the arrivals, and may in some cases have infinite expected value. All our findings are tabulated and displayed in the form of suitable graphs, including the relevant density estimates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Stochastic Models of Internet Traffic

Stochastic Traffic Analysis of Contemporary Internet High-Speed Links

Communication Networks

References

Adler R, Feldman R, Taqqu MS (1998) A practical guide to heavy tails: statistical techniques for analyzing heavy tailed distributions. Birkhauser, Boston
MATH Google Scholar
Bandara VW, Pezeshki A, Jayasumana AP (2014) A spatiotemporal model for internet traffic anomalies. IET Netw 3:41–53
Article Google Scholar
Bhuyan MH, Bhattacharyya DK, Kalita JK (2014) Network anomaly detection: methods, systems and tools. IEEE Commun Surv Tutor 16:303–336
Article Google Scholar
Chandolla V, Benerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(15):58
Google Scholar
Good PI (2013) Permutation, parametric, and bootstrap tests of hypotheses. Springer, Berlin
MATH Google Scholar
Hall P (1990) Using the bootstrap to estimate mean squared error and select smoothing parameter in nonparametric problems. J Multivar Anal 32:177–203
Article MathSciNet Google Scholar
Kallitsis M, Stoev S, Bhattacharya S, Michailidis G (2016) AMON: an open source architecture for online monitoring, statistical analysis and forensics of multi-gigabit streams. IEEE J Sel Areas Commun 34:1834–1848
Article Google Scholar
Kulkarni VG (2017) Modeling and analysis of stochastic systems. Chapman and Hall, Atlanta
MATH Google Scholar
Leland WE, Taqqu MS, Willinger W, Wilson DV (1994) On the self-similar nature of ethernet traffic (extended version). IEEE/ACM Trans Netw 2:1–15
Article Google Scholar
Liao H-J, Lin C-HR, Lin Y-C, Tung K-Y (2013) Intrusion detection system: a comprehensive review. J Netw Comput Appl 36:16–24
Article Google Scholar
Park K, Willinger W (2000) Self-similar network traffic and performance evaluation. Wiley, Hoboken
Book Google Scholar
Paschalidis IC, Smaragdakis G (2009) Spatio-temporal network anomaly detection by assessing deviations of empirical measures. IEEE/ACM Trans Netw 17:685–697
Article Google Scholar
Peng L, Qi Y (2017) Inference for heavy-tailed data analysis: applications in insurance and finance. Academic Press, Cambridge
MATH Google Scholar
Resnick SI (1997) Heavy tail modeling and teletraffic data. Ann Stat 25:1805–1869
Article MathSciNet Google Scholar
Resnick SI (2007) Heavy-tail phenomena: probabilistic and statistical modeling. Springer, Berlin
MATH Google Scholar
Shumway RH, Stoffer DS (2017) Time series analysis and its applications with R examples. Springer, Berlin
Book Google Scholar
Tsai C-F, Hsu Y-F, Lin C-Y, Lin W-Y (2009) Intrusion detection by machine learning: a review. Expert Syst Appl 39:11994–12000
Article Google Scholar
Vaughan J, Stoev S, Michailidis G (2013) Network-wide statistical modeling, prediction and monitoring of computer traffic. Technometrics 55:79–93
Article MathSciNet Google Scholar
Xie M, Han S, Tian B, Parvin S (2011) Anomaly detection in wireless sensor networks: a survey. J Netw Comput Appl 34:1302–1325
Article Google Scholar
Zarpelao BB, Miani RS, Kawakani CT, de Alvarenga SC (2017) A survey of intrusion detection in internet of things. J Netw Comput Appl 84:25–37
Article Google Scholar

Download references

Acknowledgements

This research has been partially supported by NSF grants DMS–1737795, DMS 1923142 and CNS 1932413. We thank Professor Anura P. Jayasumana of the CSU’s Department of Electrical and Computer Engineering for sharing the Internet2 anomalies data.

Author information

Authors and Affiliations

Department of Statistics, Colorado State University, Fort Collins, CO, 80522, USA
Piotr Kokoszka, Hieu Nguyen & Haonan Wang
Electrical and Computer Engineering, Colorado State University, Fort Collins, CO, 80522, USA
Liuqing Yang

Authors

Piotr Kokoszka
View author publications
You can also search for this author in PubMed Google Scholar
Hieu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Haonan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liuqing Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Piotr Kokoszka.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Estimation of the mean time between the arrivals of the anomalies

As explained in Sect. 4, to reliably estimate the distribution of the waiting time, we need a good estimator of the mean interarrival time. In the context of our data, this is a delicate task because the interarrival times have heavy tails, which will bias the usual sample mean. We compared the performance (root mean squared error-RMSE) of the following 12 mean estimators:

Median,
Huber location estimator with varying truncation constants: $k = 5, 10, 20, 30$,
Sample mean,
Trimmed mean with varying trimming fractions: trim = 0.025, 0.05, 0.10, 0.15, 0.20,
The estimator $\hat{\tau }= \int _0^{\infty } (1 - \widehat{G}(t)) dt$, based on the formula $\tau = E(X) = \int _0^{\infty }P(X>t)dt$.

A crucial question is to generate observation from a distribution which resembles the distribution of the real interarrival times, and whose mean (expected value) can be computed analytically. We can then consider differences between an estimated mean and the true mean within a simulation study. Since interarrival times have many small values and dominating large values with lower frequencies, we use a mixture model: interarrival times come from a Weibull distribution with probability p and from a half-t distribution with probability $1-p$. The Weibull component is designed to model the occurrence of small values, while the half-t component is designed to model the tail behavior and allow for either finite or infinite variance. (The t distribution satisfies (3.1) with $\alpha $ equal to the degrees of freedom parameter $\nu $.) The density function from which we simulated observations is thus given by

$$\begin{aligned} f(x) = p f_{w}(x; k, \lambda ) + (1-p)f_{t}(x; \nu , \sigma ), \ \ \ x > 0, \end{aligned}$$

where

$$\begin{aligned} f_{w}(x;k, \lambda )&= \frac{k}{\lambda } \left( \frac{x}{\lambda }\right) ^{k-1}e^{-(x/\lambda )^k}, \\ f_{t}(x;\nu , \sigma )&= 2\frac{\Gamma ((\nu +1)/2)}{\Gamma (\nu /2)\sqrt{\nu \pi \sigma ^2}} \left[ 1 + \frac{1}{\nu }\frac{x^2}{\sigma ^2}\right] ^{-(\nu +1)/2}. \end{aligned}$$

and where $k, \lambda , \nu , \sigma > 0$ and $0 \le p \le 1$. For this model the value of $\tau $ can be computed, it is equal to

$$\begin{aligned} \tau = c\left[ \lambda \Gamma (1 + 1/k)\right] + (1-c)\left[ 2\sigma \sqrt{\frac{\nu }{\pi }} \frac{\Gamma ((\nu +1)/2)}{\Gamma (\nu /2)(\nu -1)}\right] , \ \ \ \nu > 1. \end{aligned}$$

We estimated the mixture model using Maximum Goodness-of-fit Estimator, which minimizes Kolmogorov–Smirnov distance, using the R package fitdistrplus. We then used the estimated model to generate $n=1000$ samples of synthetic interarrival times to compute Monte Carlo RMSE of mean estimators as

$$\begin{aligned} RMSE = \sqrt{\frac{1}{n} \sum _{r=1}^{n}(\hat{\tau }_{r} - \tau )^2}, \end{aligned}$$

where $\hat{\tau }_r$ is a mean estimator computed from the rth Monte Carlo sample, and $\tau $ is the mean of the estimated mixture model.

We found that the mixture model has good fit to the observed interarrival times. The fits in all links are similar to those shown in Fig. 2. The Kolmogorov–Smirnov goodness-of-fit test also fails to reject, for all links, the null hypothesis of equal distribution between real and simulated data. In all 28 links, estimates for $\nu $ are between 1.2 and 2.2, so the half-t distribution successfully captures the tail behavior inferred from the Hill plots. Since the $\nu $ estimates are all greater than 1, the means of estimated mixture distributions exist.

Table 3 RMSEs of the sample mean and three best mean estimators for interarrival time; bold indicates the lowest RMSE among the 12 estimators we considered

Full size table

The RMSEs for the sample mean, the most commonly used estimator, and the three best estimators are shown in Table 3. We see that sample mean performs poorly. The estimator $\hat{\tau }$, which we used in Sect. 4, is most often the best estimator, and when it is not, its RMSE is very close to the lowest RMSE. This justifies its choice as the preferred mean estimator for the interarrival time.

Significance tests

We present here formal statistical significance tests that confirm the conclusions stated in Sect. 4. We first consider the testing problem:

$H_0:$:: The distributions of interarrival times are identical for the 28 links,
$H_A:$:: The distributions of interarrival times are not identical for the 28 links.

Since, as shown in Sect. 4, $f_B(x) = \tau ^{-1}(1 - G(x))$, this test also applies to the distributions of waiting times. If these distributions are equal, then their expected values are also equal. We therefore use a permutation test based on the usual F-statistic:

$$\begin{aligned} F = \frac{U}{V}, \ \ \ U:= \frac{\sum _{i = 1}^{28}(\overline{X}_{i.} - \overline{ X}_{..})^2 n_{i}}{28-1}, \ \ V:= \frac{\sum _{i =1}^{28}\sum _{j=1}^{n_i}(X_{ij} - \overline{ X}_{..})^2}{N - 28}, \end{aligned}$$

(B.1)

where $\overline{X}_{i.}$ is the sample mean of interarrival times in link i, $\overline{X}{..}$ is the sample mean of interarrival times across all links, $n_i$ is the number of observed interarrival times for link i, N is the number of observed interarrival times in all 28 links.

The observed value of the test statistic is $F=5.52$. However, we cannot compare it to a tabulated critical value because the distribution of the interarrival times is not normal. We therefore estimate the null distribution using permutations, see e.g. Good (2013). Under $H_0$, the interarrival times among the 28 links are iid random variables; hence, by randomly reassigning the N interarrival times to the 28 groups, such that the number of observations in each group is not changed, we produce a new pseudo dataset for which $H_0$ is true. We resample this way for 10,000 times, and obtain the null distribution of the test statistics shown in Fig. 7. It is seen that the observed value of $F=5.52$ is far to the right of the range of the test statistics under the null hypothesis. Formally, we approximate the p-value with the proportion of samples with $F > 5.52$, and see that $p-value < 0.0001$. As the result, we reject $H_0$.

We also performed the Anderson–Darling test with the R package kSamples. The standardized Anderson–Darling test statistics is 28.36 with $p-value < 0.0001$. Hence, we also reject $H_0$.

We conclude that the interarrival time distributions among the 28 links are not identical; hence the waiting time distributions among the 28 links are not identical either.

We also used three standard goodness-of-fit tests implemented with R package EnvStats: Kolmogorov–Smirnov test, Cramer-von Mises test and Anderson–Darling test to check if the distribution of the interarrival times is exponential. For all 28 links, and for each test, the null hypothesis of an exponential distribution is rejected at the significance level of 5 percent. We conclude that the anomaly interarrival time does not have an exponential distribution.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kokoszka, P., Nguyen, H., Wang, H. et al. Statistical and probabilistic analysis of interarrival and waiting times of Internet2 anomalies. Stat Methods Appl 29, 727–744 (2020). https://doi.org/10.1007/s10260-019-00500-x

Download citation

Accepted: 15 November 2019
Published: 21 November 2019
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10260-019-00500-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistical and probabilistic analysis of interarrival and waiting times of Internet2 anomalies

Abstract

Access this article