1 Introduction

Event time data exhibit periodic behaviour in many real-life applications, for example astrophysics (Cicuttin et al. 1998), bioinformatics (Kocak et al. 2013), object tracking (Li et al. 2010) and computer networks (Heard et al. 2014; Price-Williams et al. 2017). The periodic arrival times can often be mixed with non-periodic events. Therefore, to model the generating process appropriately, it is required to correctly distinguish the event types. This article proposes a statistical method for classification of periodic arrivals within a sequence of event times.

This work is motivated by important applications in computer network security. In particular, network flow (NetFlow) data are analysed. Network flow (NetFlow) data provide information about Internet Protocol (IP) connections between nodes in a computer network and have been successfully used to monitor network traffic (Hofstede et al. 2014). These data are routinely collected in bulk at internet routers, providing large databases of IP address connections. Commonly, a large proportion of the connections from a network host can be ascribed to legitimate, automated polling to various services. It is therefore an important step in the model-building process to be able to correctly identify which connections are due to the presence of a human at the machine, and which others are purely automated. Making this distinction is crucial for network monitoring and statistical intrusion detection: anomalies related to the presence of an intruder within the network will be significantly easier to detect when the polling connections are filtered out from the analysis. Realistic modelling strategies seek to treat the two components separately: Price-Williams and Heard (2020) show that a nonparametric Wold process with step function excitation is a suitable choice for modelling human events in computer network traffic data. That model can only be applied when periodic connections are not present: this article provides a statistical framework for filtering automated traffic when human events are mixed with polling connections.

A useful network-wide filtering approach for polling behaviour, based on Fisher’s g-test for periodicities, was proposed in Heard et al. (2014). For each pair of network nodes, the method looks for strong peaks in the periodogram of the event series of connections along that edge. The methodology is specifically developed in the context of computer network data, but it can be applied to any sequence of arrival times. A limitation of this approach is that all connections from an edge are deemed to be automated if the maximal periodicity for that edge is found to be significant, whereas activity on some network edges can contain a mixture of both automated and human activity. For example, connections to an email server are continuously refreshed with a fixed periodicity, but the user might also manually ask if new messages have been received. It is therefore potentially valuable to further understand which of the events on such edges are actually associated with the presence of a user. This article aims to complement the existing methodologies and provide a data filtering algorithm for network connection records, where each connection on an edge will be classified as periodic or non-periodic through a mixture probability model. Note that the aim of the paper is not to discern malicious automated activities, such as those generated by botnets, from human activities, but to provide a statistical technique for separating purely automated, polling activity, either malicious or legitimate, from non-periodic connections, which also include human activity.

The problem of periodicity detection in computer network traffic has been extensively studied in the computer science literature. Common approaches include spectral analysis (Barbosa et al. 2012; AsSadhan and Moura 2014; Heard et al. 2014; Price-Williams et al. 2017), which are often combined with thresholding methods (Bartlett et al. 2011; Huynh et al. 2016; Chen et al. 2016). Alternatives include modelling of inter-arrival times (Bilge et al. 2012; Qiao et al. 2012; Hubballi and Goyal 2013), where distributional assumptions are imposed and the behaviour is tested under the null of no periodicities (He et al. 2009; McPherson and Ortega 2011). Finally, some authors identify signals of periodicities in the autocorrelation function (Gu et al. 2008; Qiao et al. 2013), using changepoint methods (Price-Williams et al. 2017) or summary statistics computed sequentially in time windows (Eslahi et al. 2015). Price-Williams et al. (2017) also use wrapped distribution for detecting automated subsequences of events, and their methodology is able to handle changes in the periodicity and parameters in the model, but the human activity within a periodic subsequence is not captured. Most models proposed in the literature are aimed at classifying the entire edge as purely periodic or non-periodic. The model proposed in this article further analyses the edges with dominant periodicities, with the objective of recovering the human connections, when present; each observation is separately classified as periodic or non-periodic. The models described in this article could make a direct contribution to real-world network analysis, providing an efficient method for separating human and polling connections on the same edge, allowing deployment of the existing methodologies (Price-Williams and Heard 2020) for analysis of the filtered events.

The remainder of the article is organised as follows: Sect. 2 summarises the use of Fisher’s g-test for identifying the dominant periodicity in event time data. Using that periodicity, Sect. 3 introduces two transformations of event times which will be used to classify individual events as periodic or non-periodic. Models for these two quantities are presented in Sects. 4 and 5, respectively. Applications on real and synthetic data are discussed in Sect. 6.

2 Fisher’s g-test for detecting periodicities in event time data

Let \(t_1<t_2<\cdots <t_N\) be a sequence of arrival times, and \(N(\cdot )\) be a counting process recording the number of events over time. In the computer network application, \(N(\cdot )\) counts connections over time from the client to the server, for any particular client and server pair. It is most practical to treat \(N(\cdot )\) as a discrete-time process, with connection counts aggregated within bins of fixed width \(\delta \). Thus, N(t) will denote the number of events after \(t\delta \) seconds. The increments of the process are the corresponding bin counts \(\mathrm dN(t)=N(t)-N(t-1)\).

After T time units of observation, the discrete Fourier transform for the zero-mean corrected process yields the periodogram

$$\begin{aligned} {\hat{S}} (f)=\frac{1}{T}\left| \sum _{t=1}^T \{\mathrm {d}N(t) - N(T)/T\} \mathrm{e}^{-2\pi \imath ft} \right| ^2. \end{aligned}$$

The fast Fourier transform (FFT) allows efficient computation of \({\hat{S}}(f_k)\) at the discrete-time Fourier frequencies \(f_k=k/T\), \(k=1, \ldots ,m\) where \(m=\lfloor T/2\rfloor \), in \({\mathcal {O}}(T\log T)\) operations. Peaks in the periodogram values might correspond to periodic signals in the sequence of arrival times. Fisher (1929) proposed an exact test for the null hypothesis of no periodicities using the g-statistic,

$$\begin{aligned} g({\hat{S}}) = \frac{{\max _{1\le k\le m} {\hat{S}}(f_k)}}{{\sum _{1\le j\le m} {\hat{S}}(f_j)}}. \end{aligned}$$
(1)

The test arises in the theory of harmonic time series analysis and is the uniformly most powerful symmetric invariant procedure (Anderson 1971) against the alternative hypothesis of a periodicity existing at a single Fourier frequency, for a null hypothesis of a white noise spectrum (for further details, see Percival and Walden 1993). Under such a null hypothesis, Fisher (1929) derived an exact p-value for a realised value g of \(g({{\hat{S}}})\), for which there also exists a convenient asymptotic approximation (Jenkins and Priestley 1957):

$$\begin{aligned} {\mathbb {P}}\{g({\hat{S}})>g\}&=\sum _{j=1}^{\min \left\{ \left\lfloor 1/g \right\rfloor , m \right\} } (-1)^{j-1}~{m\atopwithdelims ()j}~(1-jg)^{m-1}\nonumber \\&\approx 1-\{1-\exp (-mg)\}^m. \end{aligned}$$
(2)

If an sequence of arrival times is found to be periodic at a given significance level, the corresponding period is

$$\begin{aligned} p =\delta \left\{ \text {argmax}_{f_k:1\le k \le m} {\hat{S}}(f_k)\right\} ^{-1}. \end{aligned}$$
(3)

In a Bayesian setting, methods to detect periodicities have been developed in astrophysics and astrostatistics (Jaynes 1987), or in biostatistics and bioinformatics (de Lichtenberg et al. 2005; Kocak et al. 2013). None of these methods are fully scalable and as easy to interpret as the g-test; therefore, for the purposes of this work, the periodicities will be obtained using (1) and the corresponding p-value (2).

In computer network traffic, if the p-value is below a pre-specified small significance level, then the entire edge is deemed to be periodic. Otherwise, if an edge is found to be not significantly periodic, then it is assumed that the majority of the activity on that edge can be ascribed to non-periodic events, possibly related to the presence of a human at the machine. If an edge is classified as periodic using the g-test, it is also possible that the observed connections contain a mixture of both polling and human activity. The objective of this paper is to further refine the classification performance for such mixed-type edges, classifying not only the entire edge activity as periodic or non-periodic, but each observed event on the edge.

The performance of the g-test on mixtures of periodic and non-periodic event times can be investigated via simulation. A sequence of 1000 events repeating every \(p=10\) s is generated and mixed with events generated from a Poisson process on the same time frame, with different rates \(\lambda \). For each value of \(\lambda \), the simulation is repeated 100 times to estimate the expected p-value (2) from the g-test and the results are reported in Fig. 1. For interpretability, the mean proportion of periodic events, which is monotonically decreasing in \(\lambda \), is plotted on the horizontal axis. It is clear from Fig. 1 that the expected p-value decreases when the proportion of periodic events increases, but the p-value is sufficiently small even when the proportion of automated events is small. For example, for a \(2\%\) percentage of polling arrival times, the resulting expected p-value of the g-test is \(\approx 0.0001\).

Fig. 1
figure 1

Expected p-value for the g-test against the percentage of periodic events

3 Circular statistics for classifying event times

Let \(t_1<t_2<\dots <t_N\) be a sequence of arrival times, and let \(\varvec{z}=(z_1, \ldots ,z_N)\) be a vector of binary indicator variables, such that \(z_i=1\) if the ith event was periodic, and \(z_i=0\) if it was non-periodic or human-generated. For each event time \(t_i\), the following circular transformation can be defined

$$\begin{aligned} x_i=\frac{2\pi }{p} (t_i \bmod {p}). \end{aligned}$$
(4)

This transformation is particularly suitable for sequences presenting fixed phase polling (Price-Williams et al. 2017): the event times are expected to occur every p seconds, with a zero-mean random error. Wrapping the sequence to \([0,2\pi )\) also makes the methodology robust to dropouts in the observations. If the events occur p seconds after the preceding arrival time, plus error, then the sequence exhibits fixed duration polling, and a more appropriate transformation might be:

$$\begin{aligned} x_i=\frac{2\pi }{p} \{(t_i-t_{i-1}) \bmod {p}\}, \end{aligned}$$

with \(x_1=0\). This article mostly concerns with fixed phase polling, but the methodology could be adapted to the case of fixed duration polling.

The aim of this article is to use the observed vector \(\varvec{x}=(x_1, \ldots ,x_N)\) to estimate \(\varvec{z}\). For the ith event, the first measurement, \(x_i\), will reveal whether it was synchronous with the polling found to occur on those arrival times.

In some applications, a second known periodic effect will be present, such as a daily or annual seasonality. Denote this second periodicity \(p^\prime \), where typically \(p\ll p^\prime \). A second circular transformation can then be defined:

$$\begin{aligned} y_i=\frac{2\pi }{p^\prime } (t_i \bmod {p^\prime }) . \end{aligned}$$
(5)

Within the application in computer networks, it could be assumed \(p^\prime =86{,}400\,{\hbox {s}}\). Such measurement will show the time of day (which is 86,400 s long when there are no clock changes) at which the event occurred, and can be compared against an inferred diurnal model corresponding to human activity. More generally, one could be interested in the estimation of the density of the non-periodic events on the entire observation period, which yields the generic transformation \({{\tilde{t}}}_i = 2\pi t_i/T\).

In the next section, a mixture probability model for \(\varvec{x}\) is proposed, which can be used to classify events purely on their synchronicity with the polling signal. Then, in Sect. 5 the model is extended to incorporate \(\varvec{y}\), to see how much extra discriminative information can be extracted from the time of day. Note that the measurements (4) and (5) have both been scaled to lie on the unit circle with domain \([0,2\pi )\). This consistency in scaling will be convenient for specifying the full probability model (22) for event times in Sect. 5, since this makes simultaneous use of both quantities.

4 A wrapped normal–uniform mixture model

If a sequence of arrival times is classified periodic with period p (3), then a majority of the wrapped values \(\varvec{x}\) from (4) will be concentrated around a peak. A wrapped normal distribution \({\mathbb {W}}{\mathbb {N}}_{[0,2\pi )}(\mu ,\sigma ^2)\) model is therefore proposed for those events, where \(\sigma >0\) quantifies the variability of event times around the peak location \(\mu \in [0,2\pi )\). The density of \(\mathbb {WN}_{[0,2\pi )}(\mu ,\sigma ^2)\) is

$$\begin{aligned} \phi _{\mathrm {WN}}^{[0,2\pi )}(x;\mu ,\sigma ^2) = \sum _{k=-\infty }^\infty \phi (x+2\pi k;\mu ,\sigma ^2)\mathbb {1}_{[0,2\pi )}(x), \end{aligned}$$
(6)

where \(\phi (\cdot ;\mu ,\sigma ^2)\) and later \(\varPhi \{\cdot ;\mu ,\sigma ^2\}\) will represent, respectively, the density and distribution functions of the Gaussian distribution \({\mathbb {N}}(\mu ,\sigma ^2)\).

In practical applications, p will usually be relatively small; hence, it is reasonable to assume that the density of the non-periodic events is smooth and therefore locally well approximated by a uniform distribution on the unit circle. Together, these components imply a density for \(x_i\) conditional on the latent variable \(z_i\),

$$\begin{aligned} f(x_i\vert z_i) = \phi _{\mathrm {WN}}^{[0,2\pi )}(x_i;\mu ,\sigma ^2)^{z_i}(2\pi )^{z_i-1}\mathbb {1}_{[0,2\pi )}(x_i). \end{aligned}$$
(7)

Let \(\theta \in [0,1]\) be the unknown proportion of events which are generated automatically and periodically, such that \(\mathbb P(z_i=1)=\theta \). Finally, let

$$\begin{aligned} \varvec{\psi }=(\mu ,\sigma ^2,\theta ) \end{aligned}$$

be the three model parameters which have been introduced. Then, assuming the individual values of \(\varvec{x}\) are drawn independently of one another, the likelihood function of the three model parameters is

$$\begin{aligned} L(\varvec{\psi }\vert \varvec{x})= \prod _{i=1}^N \left\{ \theta \phi _{\mathrm {WN}}^{[0,2\pi )}(x_i;\mu ,\sigma ^2)+\frac{1-\theta }{2\pi }\right\} \mathbb {1}_{[0,2\pi )}(x_i). \end{aligned}$$
(8)

It is not analytically possible to optimise the likelihood in (8) directly; instead, an expectation–maximisation (EM) algorithm (Dempster et al. 1977), common for mixture models, is proposed in the next section.

4.1 An EM algorithm for parameter estimation

In order to develop an EM algorithm for estimating \(\varvec{\psi }\), it is necessary to introduce additional latent variables \(\varvec{\kappa }=(\kappa _1,\ldots ,\kappa _N)\) for the mixture components in the wrapped normal model (6). For \(1\le i\le N\), if \(z_i=0\), then let \(\kappa _i=0\) with probability 1; for \(z_i=1\) and \(k\in {\mathbb {Z}}\), let

$$\begin{aligned}&{\mathbb {P}}(\kappa _i=k\vert z_i=1,\mu ,\sigma ^2)\nonumber \\&\quad =\varPhi \{2\pi (k+1);\mu ,\sigma ^2\}-\varPhi \{2\pi k;\mu ,\sigma ^2\}. \end{aligned}$$
(9)

Further, let

$$\begin{aligned} x_i\vert z_i=1,\kappa _i=k,\mu ,\sigma ^2\sim \bar{\mathbb N}_{[0,2\pi )}(\mu -2\pi k,\sigma ^2), \end{aligned}$$

denoting a normal distribution with mean \(\mu -2\pi k\) and variance \(\sigma ^2\), truncated to \([0,2\pi )\). Then, the conditional density for \(x_i\) given \(z_i\) is again (7). The role of the latent variable \(\kappa _i\) is depicted in Fig. 2.

Fig. 2
figure 2

Interpretation of the latent variable \(\kappa \). Suppose \(x^\star \sim \mathbb {N}(\mu ,\sigma ^2)\), \(x=x^\star \bmod 2\pi \) and \(\kappa =(x^\star -x)/(2\pi )\). Then, \(x\sim \mathbb {WN}_{[0,2\pi )}(\mu ,\sigma ^2)\) and \(\kappa =k\) with probability given by (9)

Using the latent assignments \(\varvec{z}\) and \(\varvec{\kappa }\), the revised likelihood function is

$$\begin{aligned} L(\varvec{\psi }\vert \varvec{x},\varvec{z},\varvec{\kappa }) \propto \prod _{i=1}^N\left( \frac{1-\theta }{2\pi }\right) ^{1-z_i}\left\{ \theta \phi (x_i+2\pi \kappa _i;\mu ,\sigma ^2)\right\} ^{z_i}. \end{aligned}$$
(10)

At iteration m of the EM algorithm, given an estimate \(\varvec{\psi }^{(m)}\) of \(\varvec{\psi }\), the \({\mathbb {E}}\)-step computes the Q-function

$$\begin{aligned} Q(\varvec{\psi }\vert \varvec{\psi }^{(m)})={\mathbb {E}}_{\varvec{z},\varvec{\kappa }\vert \varvec{x},\varvec{\psi }^{(m)}}\left\{ \log L\left( \varvec{\psi }\vert \varvec{x},\varvec{z},\varvec{\kappa }\right) \right\} , \end{aligned}$$
(11)

where the expectation is taken with respect to the conditional distribution of \(\varvec{z}\) and \(\varvec{\kappa }\), given \(\varvec{x}\) and \(\varvec{\psi }^{(m)}\). This amounts to evaluating the so-called responsibilities,

$$\begin{aligned} \zeta _{i(j,k)}&={\mathbb {E}}_{\varvec{z},\varvec{\kappa }\vert \varvec{x},\varvec{\psi }^{(m)}}[\mathbb {1}_{(j,k)}(z_i,\kappa _i)\vert x_i,\varvec{\psi }^{(m)}] \nonumber \\&={\mathbb {P}}(z_i=j,\kappa _i=k\vert x_i,\varvec{\psi }^{(m)}), \end{aligned}$$
(12)

since, using (10), the Q-function (11) then simplifies to

$$\begin{aligned}&\sum _{i=1}^N\bigg [\zeta _{i(0,0)}\log \left( \frac{1-\theta _{(m)}}{2\pi }\right) \nonumber \\&\quad + \sum _{k=-\infty }^\infty \zeta _{i(1,k)}\log \left\{ \theta \phi (x_i;\mu _{(m)}-2\pi k,\sigma ^2_{(m)})\right\} \bigg ]. \end{aligned}$$
(13)

The responsibilities in (12) can be calculated using Bayes theorem, giving

$$\begin{aligned} \zeta _{i(j,k)}\propto \{\theta _{(m)}\phi (x_i;\mu _{(m)}-2\pi k,\sigma ^2_{(m)})\}^j \left( \frac{1-\theta _{(m)}}{2\pi }\right) ^{1-j}, \end{aligned}$$
(14)

where the normalising constant is given by the sum \(\theta _{(m)}\sum _{k^{\prime }=-\infty }^\infty \phi (x_i;\mu _{(m)}-2\pi k^{\prime },\sigma _{(m)}^2)+(1-\theta _{(m)})/2\pi \). Finally, maximising (13) with respect to \(\varvec{\psi }\) as the \(\mathbb M\)-step gives:

$$\begin{aligned} {{\tilde{\mu }}}_{(m+1)}&= \frac{{\sum _{i=1}^N\sum _{k=-\infty }^\infty (x_i+2\pi k)\zeta _{i(1,k)}}}{{\sum _{i=1}^N\sum _{k=-\infty }^\infty \zeta _{i(1,k)}}},\nonumber \\ \mu _{(m+1)}&={{\tilde{\mu }}}_{(m+1)} \bmod 2\pi , \nonumber \\ \sigma ^2_{(m+1)}&= \frac{{\sum _{i=1}^N\sum _{k=-\infty }^\infty (x_i+2\pi k-{{\tilde{\mu }}}_{(m+1)})^2\zeta _{i(1,k)}}}{{\sum _{i=1}^N\sum _{k=-\infty }^\infty \zeta _{i(1,k)}}} , \nonumber \\ \theta _{(m+1)}&= \frac{1}{N}\sum _{i=1}^N\sum _{k=-\infty }^\infty \zeta _{i(1,k)}=1-\frac{1}{N}\sum _{i=1}^N \zeta _{i(0,0)}. \end{aligned}$$
(15)

In practical computations, the infinite sums must be truncated to a suitable level.

4.2 A Bayesian formulation

Data augmentation (Higdon 1998) can be used to construct an analogue of the EM algorithm in a Bayesian setting, with a Gibbs sampler for the latent variables \(\varvec{z}\) and \(\varvec{\kappa }\). A convenient choice of prior distribution assumes a factorisation \(p(\varvec{\psi })=p(\mu ,\sigma ^2)p(\theta )\), where \(\theta \sim \mathrm {Beta}(\gamma _0,\delta _0)\) and \((\mu ,\sigma ^2)\sim \mathrm {NIG}(\mu _0,\lambda _0,\alpha _0,\beta _0)\) and \(\mathrm {NIG}\) denotes the normal-inverse gamma distribution. The chosen prior distributions are conjugate for the likelihood and therefore allow the inferential process to be analytically tractable (see Bernardo and Smith 1994, for more details). The prior and posterior probabilities for the latent assignments, conditional on \(\varvec{\psi }\), are the same as (9) and (14), respectively. Conditional on \(\varvec{z}\), the posterior distribution for the mixing proportion is

$$\begin{aligned} \theta \vert \varvec{z} \sim \mathrm {Beta}(\gamma _0+N_1,\delta _0+N_0), \end{aligned}$$

where \(N_1=\sum _{i=1}^N z_i\) is the number of automated events and \(N_0=N-N_1\) is the number of human and non-periodic automated events. The conditional posterior distribution of \(\mu \) and \(\sigma ^2\) is \(\mathrm {NIG}(\mu _{N_1},\lambda _{N_1},\alpha _{N_1},\beta _{N_1})\), where

$$\begin{aligned} {{\tilde{x}}}&=\sum _{i: z_i=1} (x_i+2\pi \kappa _i)/N_1, \nonumber \\ \mu _{N_1}&= \frac{\lambda _0\mu _0+N_1{{\tilde{x}}}}{\lambda _0+N_1}, \end{aligned}$$
(16)
$$\begin{aligned} \lambda _{N_1}&= \lambda _0 + N_1 , \nonumber \\ \alpha _{N_1}&= \alpha _0 + {N_1}/{2}, \nonumber \\ \beta _{N_1}&= \beta _0 + \frac{1}{2}\left\{ \sum _{i:z_i=1}(x_i+2\pi \kappa _i-{{\tilde{x}}})^2 + \frac{\lambda _0N_1}{\lambda _{N_1}}{({{\tilde{x}}}-\mu _0)^2}\right\} . \end{aligned}$$
(17)

Similar to the case of (15), samples \(\mu ={{\tilde{\mu }}}\bmod 2\pi \) from the posterior for \((\mu ,\sigma ^2)\) should be used, where \({{\tilde{\mu }}}\) is sampled from \(\mathrm {NIG}(\mu _{N_1},\lambda _{N_1},\alpha _{N_1},\beta _{N_1})\).

5 Incorporating time of day

The model presented in Sect. 4 only made use of \(\varvec{x}\), the arrival times once wrapped onto the unit circle according to the estimated periodicity p (4); recall that these values reveal the synchronicity of each event with the automated polling signal. However, further information might potentially be obtained from \(\varvec{y}\) (5), the times of day at which each event occurred. In computer networks, this is a reasonable assumption, since any human-generated events should be subjected to some level of diurnality. This section introduces a model for the daily distribution of human connections to help extract this extra information. Following Heard and Turcotte (2014), a flexible model for the distribution of arrivals of human events through a typical day will be obtained by assuming the density to be a step function with \(\ell \ge 1\) segments, written

$$\begin{aligned} s(y;\ell ,\varvec{\tau },\varvec{h}) = \frac{\mathbb {1}_{[0,\tau _{1})\cup [\tau _{\ell },2\pi )}(y)~h_\ell }{2\pi -\tau _{\ell }+\tau _{1}}+\sum _{j=1}^{\ell -1} \frac{\mathbb {1}_{[\tau _{j},\tau _{j+1})}(y)~h_j}{\tau _{j+1}-\tau _{j}}\mathbb {.} \end{aligned}$$
(18)

The segment probabilities \(\varvec{h}=(h_1,\ldots ,h_{\ell })\in [0,1]^\ell \) satisfy \(\sum _{j=1}^{\ell } h_j = 1\), and the circular changepoints \(\varvec{\tau }=(\tau _{1}, \ldots ,\tau _{\ell })\), \(0\le \tau _1< \ldots< \tau _\ell <2\pi \) determine the step positions.

The number of segments \(\ell \) is treated as unknown and assigned a geometric prior with parameter \(\nu \in (0,1)\) and mass function \(\nu (1-\nu )^{\ell -1}\). The natural prior for \(\varvec{h}\vert \varvec{\tau },\ell \) (Bernardo and Smith 1994) is

$$\begin{aligned}&\mathrm {Dirichlet}[\eta \varLambda \{(\tau _1,\tau _2)\}, \ldots ,\eta \varLambda \{(\tau _{\ell -1},\tau _\ell )\}, \nonumber \\&\quad \eta \varLambda \{(\tau _\ell ,2\pi )\cup (0,\tau _1)\}], \end{aligned}$$
(19)

where \(\eta >0\) is a concentration parameter and \(\varLambda \{\cdot \}\) is here taken to be the Lebesgue measure. The hierarchical specification of the model is completed with an uninformative prior on the segment locations: they are assumed to be the order statistics of \(\ell \) draws from the uniform distribution on \([0,2\pi )\).

Given \(\ell \) segments defined by changepoints \(\varvec{\tau }\), the Dirichlet probabilities \(\varvec{h}\) can be integrated out to yield the marginal likelihood of observing daily arrival times \(\varvec{y}\), which is given by

$$\begin{aligned}&\frac{c(N)\varGamma \{N^\prime _\ell +\eta (2\pi -\tau _\ell +\tau _1)\}}{\varGamma \{\eta (2\pi -\tau _\ell +\tau _1)\}(2\pi -\tau _\ell +\tau _1)^{N^\prime _\ell }} \nonumber \\&\quad \prod _{j=1}^{\ell -1}\frac{\varGamma \{N^\prime _j+\eta (\tau _{j+1}-\tau _j)\}}{\varGamma \{\eta (\tau _{j+1}-\tau _j)\}(\tau _{j+1}-\tau _{j})^{N^\prime _j}}, \end{aligned}$$
(20)

where \(N^\prime _j=\sum _{i=1}^N \mathbb {1}_{[\tau _j,\tau _{j+1})}(y_i)\) is the number of observations in the jth segment, \(1\le j \le \ell -1\), \(N^\prime _\ell =N-\sum _{j=1}^{\ell -1} N^\prime _j\) and \(c(N)=\varGamma (2\pi \eta )/\varGamma (2\pi \eta +N)\) is a normalising constant.

In contrast to the human events, automated periodic events are generated regularly by the underlying polling mechanism, which is likely to be irrespective of the time of day. Recall from Sect. 2 that the binary indicator variable \(z_i\) is defined to be equal to 1 if the ith event was periodic, and 0 otherwise. The approach which will now be adopted is to model the conditional density for the unwrapped event time \(t_i\), depending on the value of \(z_i\).

For simplicity of presentation, it will be assumed that the length of the observation period, T, will be both a whole number of days and an integer multiple of p. Under this assumption

$$\begin{aligned} f(t_i\vert z_i)&=\frac{2\pi }{T}f(x_i\vert z_i=1)^{z_i}f(y_i\vert z_i=0)^{1-z_i}\nonumber \\&=\frac{2\pi }{T}\phi _{\mathrm {WN}}^{[0,2\pi )}(x_i;\mu ,\sigma ^2)^{z_i}s(y_i;\ell ,\varvec{\tau },\varvec{h})^{1-z_i}, \end{aligned}$$
(21)

implying the marginal distribution mixture density

$$\begin{aligned} f(t_i)=\frac{2\pi }{T}\left\{ \theta \phi _{\mathrm {WN}}^{[0,2\pi )}(x_i;\mu ,\sigma ^2) +(1-\theta )s(y_i;\ell ,\varvec{\tau },\varvec{h}) \right\} . \end{aligned}$$
(22)

Figure 3 provides a graphical summary of the full model (22), and Fig. 4 shows an illustrative example of the mixture density. Note that relaxing the assumptions of divisibility of T by p or \(p^\prime \) simply requires straightforward calculation of corresponding normalising constants in (21), and this adjustment will be negligible when \(\lfloor T\delta /p\rfloor \) is large.

Fig. 3
figure 3

Graphical representation of the extended Bayesian mixture model for separation of human and automated activity

Fig. 4
figure 4

Example of the component densities in (22), with \(T=7\times {86{,}400}\) (7 days), \(p={21,600}\) (6 h), \(\mu =\pi \), \(\sigma ^2=1\), \(\theta =0.5\), \(\ell =12\), equally spaced segment locations, and step function heights \(\varvec{h}\) chosen to resemble a human daily distribution of arrival times. Upper panel: density of automated events (4 peaks per day are recorded since \(p=6\) h). Middle: density of human events (daily distribution repeated each day). Lower: the resulting mixture density (22)

Since most of the prior distributions have been chosen to be conjugate, it is possible to explicitly integrate out the segment heights \(\varvec{h}\), see (20), and the mixing proportion \(\theta \), leading to a collapsed Gibbs sampler (Liu 1994) for inference. This is advantageous, since it reduces the simulation effort to sampling the latent variables \(\varvec{z}\) and \(\varvec{\kappa }\), the parameters \(\mu \) and \(\sigma ^2\) for the wrapped normal component (6), and the number of circular changepoints \(\ell \) and their locations \(\varvec{\tau }\) in the human event density (18). The algorithm is described in detail in Appendix A.

In principle, the model could potentially be further extended. An even more general framework for density estimation in a Bayesian setting is the Dirichlet process mixture (Escobar and West 1995). Inference in this case is cumbersome, but algorithms exist for relatively fast implementation (Neal 2000).

6 Applications

The algorithms described in the previous sections have been applied on computer network flow data collected at Imperial College London, for a single client IP address X, setting \(p^\prime ={86{,}400}\). In order to show the efficacy of the methods for filtering polling traffic, examples are presented using simulated data, a synthetically fused mixture and some raw network flow data.

Fig. 5
figure 5

Estimated daily density of non-polling events for the three simulated examples described in Sect. 6.1

6.1 Simulated data

The performance of the Gibbs sampler in recovering the correct densities for the model in Sect. 5 is first assessed on simulated data. Non-periodic events were simulated from a range of densities of increasing complexity, inspired by the test signals in Donoho and Johnstone (1994), rescaled and shifted to represent probability distributions on \([0,2\pi )\). Three distributions are used: (a) a step function density with 10 segments, where the changepoints and segment probabilities were sampled from a \(\mathrm {Uniform}[0,2\pi )\) and \(\mathrm {Dirichlet(1, \ldots ,1)}\), respectively, (b) a heavisine function on \([0,2\pi )\), \(f(y)\propto 6+4\sin (2y) - \text {sgn}(y/2\pi -0.3) - \text {sgn}(0.72-y/2\pi )\), (c) a function \(f(y)\propto \sum _{j=1}^{11} u_j (1+\vert {(y/2\pi -v_j)/w_j}\vert )^{-4}\) with 11 bumps, with the same choices of Donoho and Johnstone (1994) for the parameters \(u_j\), \(v_j\) and \(w_j\), scaled to \([0,2\pi )\). 3000 events are simulated from the chosen distributions and then assigned to a random day of the week, implying \(p^\prime ={86{,}400}\). Those events are mixed with 2000 periodic events generated from a wrapped normal distribution with mean \(\mu =5\) and variance \(\sigma ^2=1\) on \([0,2\pi )\), rescaled and assigned at random to windows of \(p=10\) s over one week. Note that the variance of the periodic signal is chosen to be relatively large to make the inferential procedure more complicated. In practical applications, the value of \(\sigma ^2\) is expected to be much smaller.

The results of the Gibbs sampling procedure for estimation of the density of non-polling events, using the model in Sect. 5, are reported in Fig. 5. The algorithm is able to recover the density with good confidence, even in case of departures from the step function assumption. Note that it is not possible to expect the fit of the estimated density to correspond perfectly to the density used to simulate the data, since the simulation is repeated only once, for a sample of size 3000, and the variability for the wrapped normal component was chosen to be large.

The estimates for the remaining parameters in the simulation using the step function density resulted in \(({{\hat{\mu }}},{\hat{\sigma }}^2,{{\hat{\theta }}})=(5.0162, 0.9890, 0.4022)\). The performance of the classification algorithm can be assessed using the area under the receiver operating characteristic (ROC) curve, commonly denoted as AUC. For the step function density, the resulting AUC score is 0.8161. For the heavisine function, the estimates of the parameters are (5.0268, 0.9868, 0.3882), and \(\mathrm {AUC} = 0.8007\). Finally, for the function with bumps, the estimates are (5.0162, 0.9890, 0.4022), and \(\mathrm {AUC} = 0.9337\). The parameter estimates correspond to the values used in the simulation, and the AUC values are acceptable considering the complexity of the simulation and the fact that \(\sigma ^2=1\), much larger than the values expected in applications.

The computational efficiency of the collapsed Gibbs sampler for the model in Sect. 5 has been evaluated via simulation. Table 1 reports the elapsed time for 1000 sweeps of the sampler for different values of N, the number of observed arrival times. The experiments were performed running python code on a MacBook Pro 2017 with a 2.3 GHz Intel Core i5 dual-core processor, and the events were generated using the same simulation described in this section, with a step function density with 10 changepoints for the non-polling events.

Table 1 Elapsed time (s) for 1000 sweeps of the collapsed Gibbs sampler for the model in Sect. 5, as a function of the number of observations N

6.2 Synthetically labelled data: a mixture of automated and human connections

A fusion of two different network edges is considered: first, the activity between the client X and the Dropbox server 108.160.162.98, found to be strongly periodic at period \(p\approx 55.66\,{\text {s}}\), with associated p-value \(<0.0001\); and second, the activity between the client X and the Midasplayer server addresses 217.212.243.163 and 217.212.243.186, which exhibits activity exclusively during day, relating to a human user playing the popular online game Candy Crush. Seven days of data starting from the first observation time on each edge were used in the present analysis, resulting in 32, 865 Dropbox events and 4779 Candy Crush connections. The histograms of daily activity for the two edges are presented in Fig. 6. Notice that Dropbox is slightly more active at night than during the day, which makes the analysis more difficult. This is not an uncommon behaviour for automated edges, which tend to ‘stand down’ during the day when a human sits at the machine. On the other hand, Candy Crush events only happen during working hours.

Fig. 6
figure 6

Analysis on the synthetic Dropbox–Candy Crush data set. Top panel: histogram of the daily arrival times \(y_i\) of the Candy Crush (left) and Dropbox (middle) events (bin size: 5 min), and polar histogram of the daily arrival times for the mixed data (right). Bottom panel: polar histogram of the wrapped arrival times \(x_i\) with period \(p=55.66\) for the filtered periodic events and estimated wrapped normal density (left), estimated daily density of non-periodic events (middle) and histogram of the daily arrival times for the filtered non-periodic events (right)

The uniform-wrapped normal mixture model (cf. Sect. 4) fitted to the fused data using the EM algorithm quickly converges to the parameter estimates \(({{\hat{\mu }}},{{\hat{\sigma }}}^2,{{\hat{\theta }}})=(4.3376,0.4059,0.8585)\). The same results are obtained using different initialisation points and comparing different convergence criteria. Given the output of the EM algorithm, it is possible to filter the connections, keeping those such that \(\zeta _{i(0,0)}>\sum _{k=-\infty }^\infty \zeta _{i(1,k)}\) at the final iteration, where the infinite sum is in practice truncated to a suitable level. These filtered events are those which would be assigned to the uniform (non-periodic) component of the mixture in (7). In total, 2818 wrapped times were classified as non-periodic, and 2386 of these are connections to Candy Crush servers, resulting in a false positive rate \(\mathrm {FPR}=0.013\) and false negative rate \(\mathrm {FNR}=0.501\). Note that it is not surprising that approximately \(50\%\) of the Candy Crush edges are missed, because these fall into the high density area of the wrapped normal by chance, being approximately uniform on the p-clock.

The results from the EM algorithm were then compared to the inferences obtained from the posterior distribution of the parameters \(\varvec{\psi }\) using the Bayesian algorithm of Sect. 4.2. The prior parameters were set to the uninformative values \(\mu _0=\pi , \lambda _0=1, \alpha _0=\beta _0=\gamma _0=\delta _0=1\), although given the large quantity of data available, the choice of the prior is in practice not influential on the results of the procedure. The resulting mean of the posterior distribution for \(\varvec{\psi }\) is \(\hat{\varvec{\psi }}=({{\hat{\mu }}},{{\hat{\sigma }}}^2,{{\hat{\theta }}})=(4.3375,0.4064,0.8583)\), almost identical to the result obtained using the EM algorithm. This is expected, since the two methods represent two different inferential approaches for the same model. Very similar results are also obtained when filtering the data. For event i, let \({\hat{z}}_i\) be the Monte Carlo estimate of \(z_i\); then, classifying events as non-periodic if \({\hat{z}}_i<0.5\) yields 2810 events, and 2377 of those are Candy Crush connections, corresponding to \(\mathrm {FPR}=0.013\) and \(\mathrm {FNR}=0.502\). In practice, for the uniform-wrapped normal model, it is recommended to use the EM algorithm, which converges faster than the Bayesian Markov chain Monte Carlo (MCMC) procedure, providing, as expected, equivalent results.

Finally, it is of interest to see whether the classification performance can be improved using the extended model presented in Sect. 5. The algorithm was initialised from the output of the EM algorithm, and the additional parameters were set to the uninformative values \(\nu =0.1\) and \(\eta =1\), although again the algorithm is robust to different starting points. The resulting posterior mean estimates of wrapped normal distribution parameters are \(({{\hat{\mu }}}, {{\hat{\sigma }}}^2)=(4.362,0.376)\) and \({{\hat{\theta }}}=0.8506\) for the mixing proportion, which are slightly different from the previous analysis; in particular, the variance is lower. The estimated daily distribution of the non-periodic arrival times is plotted in Fig. 6. Note that its mean almost perfectly reproduces the histogram of the daily arrival times of the Candy Crush events. The estimated density has been obtained by sampling from the posterior distribution \(h\vert \varvec{\tau },\ell ,\varvec{y},\varvec{z}\) for each iteration of the Gibbs sampler, which has known form under the conjugate prior (19), and then averaging the density across the iterations. In this case, 3947 filtered events are labelled as non-periodic (\({\hat{z}}_i<0.5\)), with 2948 true positives, corresponding to \(\mathrm {FPR}=0.030\) and \(\mathrm {FNR}=0.383\). The resulting histogram of the filtered data is plotted in Fig. 6. The posterior distribution for the number of changepoints in the human density is approximately normally distributed around the value \(\ell =28\), which roughly corresponds to one changepoint per hour of the day.

The algorithms proposed in the article can be more efficiently compared for classification purposes using a ROC curve for different values of the threshold for \({\hat{z}}_i\). The plot for this example is reported in Fig. 7, and it clearly shows that the proposed methodologies correctly classify a significant proportion of the events very well, with low false positive rates for the threshold 0.5. Furthermore, including the daily arrival times \(\varvec{y}\) in the model is clearly beneficial. For practical applications, it is recommended to choose a threshold that guarantees low false positive rates for detection of human events: in this example, 0.5 seems an appropriate choice.

Fig. 7
figure 7

ROC curves and AUC values evaluating the performance of two methods for classification of human events: Bayesian uniform-wrapped normal mixture (Sect. 4.2) and joint Bayesian model (Sect. 5). The grey squares correspond to the threshold 0.5

6.3 Real data: Imperial College NetFlow

In this example, the activity between a client Y and the server IP 13.107.42.11, used by the software Outlook, is analysed. The arrival times refer to a time period between August 2017 and November 2017, and 7 days of activity after the first observation were considered. The daily distribution of the activity on the edge is reported in Fig. 8. A total number of 7583 connections were recorded. It can be observed from the histogram that the activity on the edge is almost entirely automatic, but the number of connections slightly increases during working hours compared to the night (compared with the dip observed in other automated services like Dropbox). This suggests a mixture between human activity and polling behaviour on this edge, which is further supported by the nature of the software. The arrival times on the edge have been found to be strongly periodic at period \(p\approx 8\,{\text {s}}\), with an associated g-test p-value \(<10^{-7}\).

Fig. 8
figure 8

Analysis on the edge \(Y\rightarrow \) 13.107.42.11 (Outlook). Top: polar histograms of the daily arrival times \(y_i\) of the events (left), and of the wrapped arrival times \(x_i\) for periodicity \(p\approx 8\,{\text {s}}\) with fitted wrapped normal density (middle) and finally of the filtered non-periodic events (right). Bottom: resulting estimated daily density of non-periodic events (left) from applying the algorithm once, and then the estimated daily density of human events (right) obtained from re-applying the algorithm with the same periodicity on the filtered events from plot (c)

The uniform-wrapped normal mixture model (cf. Sect. 4), with period \(8\,{\text {s}}\), fitted using the EM algorithm converges to the parameter estimate \(({{\hat{\mu }}},{{\hat{\sigma }}}^2,{{\hat{\theta }}})=(1.872,0.670,0.714)\). In this case study, 1246 of the 7583 events were assigned to the uniform (non-periodic) category using the criterion \(\zeta _{i(0,0)}>0.5\). Identical parameter estimates are obtained using the Bayesian mixture model, and classification of the connections as periodic or non-periodic is again almost the same, with 1232 connections classified as non-periodic from the model. Furthermore, most of the activity in the filtered events is concentrated in working hours, even though this is not explicitly encouraged by this model. This is promising since the algorithm has been able to recover human-like activity from an edge that apparently seems almost entirely automated.

Next, the Gibbs sampler was used to infer the parameters of the joint Bayesian model (cf. Sect. 5). The same prior parameter values as the previous section were used. The convergence of the sampler to the correct target is again almost immediate. The number of non-periodic connections was estimated as 1430. The resulting posterior mean for the parameters of the wrapped normal distribution for the polling component is \(({{\hat{\mu }}},{{\hat{\sigma }}}^2)=(1.885,0.6249)\) and \({{\hat{\theta }}}=0.6935\) for the mixing proportion. The daily distribution of the non-periodic connections is reported in Fig. 8 and displays a strong diurnal pattern, suggesting human behaviour has been classified well.

However, it is also evident that in this example, the algorithm classifies as human a proportion of connections occurring during the night. Potential issues that can arise are multiple periodicities or phase shifts within the same data stream. A possible solution would be to iteratively repeat the analysis on the filtered non-periodic events until no significant short-term periodicities are obtained using the g-test. In this example, repeating the analysis with period \(8\,{\text {s}}\) allows the residual automated activity to be filtered out, thereby obtaining an estimated daily distribution which is entirely consistent with human-like behaviour, shown in Fig. 8e. After this last stage of the analysis, only 181 events are retained as human-generated, corresponding to \(\approx 2.5\%\) of the initial 7583 events. This proportion is consistent with results obtained in previous studies on computer network data (Price-Williams et al. 2017).

The performance of the algorithm for filtering polling activity can also be assessed by comparing the model fit to both the filtered and unfiltered event streams when applying the nonparametric Wold process model of Price-Williams and Heard (2020), which has been shown to be suitable for human-like events in computer network traffic. There, a counting process of human-generated events is modelled with a conditional intensity represented as a step function with an inferred number of changepoints. If \(y_1,y_2,\ldots \) are the event times of such a counting process Y(t), the conditional intensity has the form

$$\begin{aligned} \lambda _Y(t)=\lambda + \sum _{j=1}^\ell \lambda _{j} \mathbb {I}_{[\tau _{j-1},\tau _{j})}(t-y_{Y(t)}), \end{aligned}$$
(23)

where \(0\equiv \tau _0<\tau _1<\ldots <\tau _\ell \) are a finite sequence of changepoints and \(\lambda _1>\ldots >\lambda _\ell \) are a decreasing sequence of corresponding step heights, representing the fall in intensity experienced as the waiting time increases between the current time t and the most recent event \(y_{Y(t)}\). In contrast, periodic network events are not self-exciting; their conditional intensity would decrease immediately after an event, and only increase when the next periodic signal is anticipated.

Price-Williams and Heard (2020) used predictive p-values to assess model fit of the intensity model (23); defining \(y_0\equiv 0\) and the compensator function \(\varLambda (t) = \int _{s=0}^t \lambda _Y(s)ds\), a lower-tail p-value of the ith waiting time is

$$\begin{aligned} p_i = 1- \exp \left[ - \{\varLambda (y_{i}) - \varLambda (y_{i-1})\} \right] . \end{aligned}$$
(24)

Figure 9 reports the QQ plot of the distribution of predictive p-values (24) obtained using the first 4 days of observations as training data and the remaining days as test data, for both the unfiltered and filtered non-periodic events from Fig. 8c. The distribution of the p-values clearly improves when the filtered non-periodic events are used. The Kolmogorov–Smirnov (KS) score, based on the maximum absolute difference between the empirical and theoretical CDFs, significantly decreases for the filtered events, reaching a value which is consistent with the results obtained by Price-Williams and Heard (2020) on Imperial College NetFlow data.

Fig. 9
figure 9

Uniform QQ plots of predictive p-values on the unfiltered and filtered non-periodic events, obtained using the nonparametric Wold process model of Price-Williams and Heard (2020), and corresponding Kolmogorov–Smirnov scores

This example strikingly shows characteristics of real computer network traffic data: the activity on automatic edges only slightly increases during the day due to the presence of a human at the machine. Despite these difficulties, the algorithm was successfully able to derive a reasonable distribution for the human events.

7 Conclusion

In this article, a statistical framework for classification of arrival times in event time data has been proposed. The methodology was motivated by application to computer network modelling for cyber-security. In particular, the filtering methodology developed in Heard et al. (2014) has been extended to network edges that present a mixture of human and automated polling activity, in order to prevent the loss of information caused by totally removing a seemingly automated edge from the analysis. This has initially been achieved using a simple mixture model based on a uniform distribution and a wrapped normal distribution on the unit circle. Frequentist and Bayesian algorithms for the estimation of the parameters have been presented. The model has then been extended to include available information on the daily arrival times of the events, demonstrating significant performance improvements on synthetic data sets with known labels. Bayesian inference is straightforward since simple conjugate distributions are used, and therefore, minimal adaptation is required from the user. Synthetically fused and real data examples show that the model is able to successfully recover a significant amount of the non-periodic activity and its distribution.

After fitting the model, the estimated values of the parameters can be used for instantaneous estimation of \(z_{i^\prime }\) for classification of future arrival times \(t_{i^\prime }\). Depending on the application, it might be necessary to update the parameter estimates from time to time as more data become available. The Bayesian framework naturally allows for prior-posterior updates, where the estimated posterior parameters can be used as prior hyperparameters when new data are available (Bernardo and Smith 1994). In that case, it would be necessary to perform the inferential procedure again, including the newly observed arrival times, and possibly removing a subset of the old observations to both fix the overall computational cost of the inferential procedure, which otherwise grows in N as shown in Table 1, and allow for any adaptation in the model.

The methodology proposed in this article generically fits within the literature on Bayesian model-based clustering (see Lau and Green 2007, for example), where MCMC methods are commonly used for inference on the latent allocations and model parameters (West et al. 1994; Richardson and Green 1997, for example). The proposed model complements and extends this literature, providing a Bayesian framework for classification of event time data, when a mixture of periodic and non-periodic events is observed.

Further possible extensions of the model could allow explicit accounting for phase shifts, using mixtures of wrapped normals with shared variances for the automated component, or allowing for changepoints in the mean \(\mu \) of the wrapped normal distribution, accounting for the arrival order of each \(x_i\). Furthermore, the case of multiple periodicities could be considered, using tests for multiple polling frequencies, for example Siegel (1980), yielding periodicities \(p_1, \ldots ,p_K\), and obtaining a mixture with multiple transformations \(x_{ik} = (t_i\bmod p_k) \times 2\pi / p_k,\ k=1, \ldots ,K\). The model could also be adapted to allow for fixed duration polling, and alternative distributions could also be considered for the automated component, for example the wrapped Laplace distribution.

Within the application of computer network security, improvements might be achieved by including host specific information; unified data sets of this type have recently become available (Turcotte et al. 2018). Finally, the algorithm can be applied independently on multiple computer network edges, or, in principle, the same human density could be fitted for all edges emanating from the same source node, but allowing for different periodicities for traffic on each edge.

8 Supplementary material

The python code and datasets used in this article are publicly available in the repository https://github.com/fraspass/human_activity.