Abstract
Motivated by the need to introduce design improvements to the Internet network to make it robust to high traffic volume anomalies, we analyze statistical properties of the time separation between arrivals of consecutive anomalies in the Internet2 network. Using several statistical techniques, we demonstrate that for all unidirectional links in Internet2, these interarrival times have distributions whose tail probabilities decay like a power law. These heavytailed distributions have varying tail indexes, which in some cases imply infinite variance. We establish that the interarrival times can be modeled as independent and identically distributed random variables, and propose a model for their distribution. These findings allow us to use the tools of of renewal theory, which in turn allows us to estimate the distribution of the waiting time for the arrival of the next anomaly. We show that the waiting time is stochastically substantially longer than the time between the arrivals, and may in some cases have infinite expected value. All our findings are tabulated and displayed in the form of suitable graphs, including the relevant density estimates.
This is a preview of subscription content, access via your institution.
References
Adler R, Feldman R, Taqqu MS (1998) A practical guide to heavy tails: statistical techniques for analyzing heavy tailed distributions. Birkhauser, Boston
Bandara VW, Pezeshki A, Jayasumana AP (2014) A spatiotemporal model for internet traffic anomalies. IET Netw 3:41–53
Bhuyan MH, Bhattacharyya DK, Kalita JK (2014) Network anomaly detection: methods, systems and tools. IEEE Commun Surv Tutor 16:303–336
Chandolla V, Benerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(15):58
Good PI (2013) Permutation, parametric, and bootstrap tests of hypotheses. Springer, Berlin
Hall P (1990) Using the bootstrap to estimate mean squared error and select smoothing parameter in nonparametric problems. J Multivar Anal 32:177–203
Kallitsis M, Stoev S, Bhattacharya S, Michailidis G (2016) AMON: an open source architecture for online monitoring, statistical analysis and forensics of multigigabit streams. IEEE J Sel Areas Commun 34:1834–1848
Kulkarni VG (2017) Modeling and analysis of stochastic systems. Chapman and Hall, Atlanta
Leland WE, Taqqu MS, Willinger W, Wilson DV (1994) On the selfsimilar nature of ethernet traffic (extended version). IEEE/ACM Trans Netw 2:1–15
Liao HJ, Lin CHR, Lin YC, Tung KY (2013) Intrusion detection system: a comprehensive review. J Netw Comput Appl 36:16–24
Park K, Willinger W (2000) Selfsimilar network traffic and performance evaluation. Wiley, Hoboken
Paschalidis IC, Smaragdakis G (2009) Spatiotemporal network anomaly detection by assessing deviations of empirical measures. IEEE/ACM Trans Netw 17:685–697
Peng L, Qi Y (2017) Inference for heavytailed data analysis: applications in insurance and finance. Academic Press, Cambridge
Resnick SI (1997) Heavy tail modeling and teletraffic data. Ann Stat 25:1805–1869
Resnick SI (2007) Heavytail phenomena: probabilistic and statistical modeling. Springer, Berlin
Shumway RH, Stoffer DS (2017) Time series analysis and its applications with R examples. Springer, Berlin
Tsai CF, Hsu YF, Lin CY, Lin WY (2009) Intrusion detection by machine learning: a review. Expert Syst Appl 39:11994–12000
Vaughan J, Stoev S, Michailidis G (2013) Networkwide statistical modeling, prediction and monitoring of computer traffic. Technometrics 55:79–93
Xie M, Han S, Tian B, Parvin S (2011) Anomaly detection in wireless sensor networks: a survey. J Netw Comput Appl 34:1302–1325
Zarpelao BB, Miani RS, Kawakani CT, de Alvarenga SC (2017) A survey of intrusion detection in internet of things. J Netw Comput Appl 84:25–37
Acknowledgements
This research has been partially supported by NSF grants DMS–1737795, DMS 1923142 and CNS 1932413. We thank Professor Anura P. Jayasumana of the CSU’s Department of Electrical and Computer Engineering for sharing the Internet2 anomalies data.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
Estimation of the mean time between the arrivals of the anomalies
As explained in Sect. 4, to reliably estimate the distribution of the waiting time, we need a good estimator of the mean interarrival time. In the context of our data, this is a delicate task because the interarrival times have heavy tails, which will bias the usual sample mean. We compared the performance (root mean squared errorRMSE) of the following 12 mean estimators:

Median,

Huber location estimator with varying truncation constants: \(k = 5, 10, 20, 30\),

Sample mean,

Trimmed mean with varying trimming fractions: trim = 0.025, 0.05, 0.10, 0.15, 0.20,

The estimator \(\hat{\tau }= \int _0^{\infty } (1  \widehat{G}(t)) dt\), based on the formula \(\tau = E(X) = \int _0^{\infty }P(X>t)dt\).
A crucial question is to generate observation from a distribution which resembles the distribution of the real interarrival times, and whose mean (expected value) can be computed analytically. We can then consider differences between an estimated mean and the true mean within a simulation study. Since interarrival times have many small values and dominating large values with lower frequencies, we use a mixture model: interarrival times come from a Weibull distribution with probability p and from a halft distribution with probability \(1p\). The Weibull component is designed to model the occurrence of small values, while the halft component is designed to model the tail behavior and allow for either finite or infinite variance. (The t distribution satisfies (3.1) with \(\alpha \) equal to the degrees of freedom parameter \(\nu \).) The density function from which we simulated observations is thus given by
where
and where \(k, \lambda , \nu , \sigma > 0\) and \(0 \le p \le 1\). For this model the value of \(\tau \) can be computed, it is equal to
We estimated the mixture model using Maximum Goodnessoffit Estimator, which minimizes Kolmogorov–Smirnov distance, using the R package fitdistrplus. We then used the estimated model to generate \(n=1000\) samples of synthetic interarrival times to compute Monte Carlo RMSE of mean estimators as
where \(\hat{\tau }_r\) is a mean estimator computed from the rth Monte Carlo sample, and \(\tau \) is the mean of the estimated mixture model.
We found that the mixture model has good fit to the observed interarrival times. The fits in all links are similar to those shown in Fig. 2. The Kolmogorov–Smirnov goodnessoffit test also fails to reject, for all links, the null hypothesis of equal distribution between real and simulated data. In all 28 links, estimates for \(\nu \) are between 1.2 and 2.2, so the halft distribution successfully captures the tail behavior inferred from the Hill plots. Since the \(\nu \) estimates are all greater than 1, the means of estimated mixture distributions exist.
The RMSEs for the sample mean, the most commonly used estimator, and the three best estimators are shown in Table 3. We see that sample mean performs poorly. The estimator \(\hat{\tau }\), which we used in Sect. 4, is most often the best estimator, and when it is not, its RMSE is very close to the lowest RMSE. This justifies its choice as the preferred mean estimator for the interarrival time.
Significance tests
We present here formal statistical significance tests that confirm the conclusions stated in Sect. 4. We first consider the testing problem:
 \(H_0:\):

The distributions of interarrival times are identical for the 28 links,
 \(H_A:\):

The distributions of interarrival times are not identical for the 28 links.
where \(\overline{X}_{i.}\) is the sample mean of interarrival times in link i, \(\overline{X}{..}\) is the sample mean of interarrival times across all links, \(n_i\) is the number of observed interarrival times for link i, N is the number of observed interarrival times in all 28 links.
The observed value of the test statistic is \(F=5.52\). However, we cannot compare it to a tabulated critical value because the distribution of the interarrival times is not normal. We therefore estimate the null distribution using permutations, see e.g. Good (2013). Under \(H_0\), the interarrival times among the 28 links are iid random variables; hence, by randomly reassigning the N interarrival times to the 28 groups, such that the number of observations in each group is not changed, we produce a new pseudo dataset for which \(H_0\) is true. We resample this way for 10,000 times, and obtain the null distribution of the test statistics shown in Fig. 7. It is seen that the observed value of \(F=5.52\) is far to the right of the range of the test statistics under the null hypothesis. Formally, we approximate the pvalue with the proportion of samples with \(F > 5.52\), and see that \(pvalue < 0.0001\). As the result, we reject \(H_0\).
We also performed the Anderson–Darling test with the R package kSamples. The standardized Anderson–Darling test statistics is 28.36 with \(pvalue < 0.0001\). Hence, we also reject \(H_0\).
We conclude that the interarrival time distributions among the 28 links are not identical; hence the waiting time distributions among the 28 links are not identical either.
We also used three standard goodnessoffit tests implemented with R package EnvStats: Kolmogorov–Smirnov test, Cramervon Mises test and Anderson–Darling test to check if the distribution of the interarrival times is exponential. For all 28 links, and for each test, the null hypothesis of an exponential distribution is rejected at the significance level of 5 percent. We conclude that the anomaly interarrival time does not have an exponential distribution.
Rights and permissions
About this article
Cite this article
Kokoszka, P., Nguyen, H., Wang, H. et al. Statistical and probabilistic analysis of interarrival and waiting times of Internet2 anomalies. Stat Methods Appl 29, 727–744 (2020). https://doi.org/10.1007/s1026001900500x
Accepted:
Published:
Issue Date:
Keywords
 Heavytailed distributions
 Interarrival times
 Internet anomalies
 Renewal theory