1. Introduction

Data analysts have often to deal with data that exhibit a variability that differs from what they expect on the basis of the hypothesized model. The phenomenon is known as overdispersion if the observed variability exceeds the expected variability or underdispersion if it is lower than expected.

Such differences between observed and nominal variances can be interpreted as brought about by failures of some of the basic assumptions of the model. These can be classified by the mechanism leading to them. As summarized by Xekalaki ([2006]), in traditional experimental contexts, they may be caused by deviations from the hypothesized structure of the population, due to lack of independence between individual item responses, contagion, clustering, and heterogeneity. In observational study contexts, on the other hand, they are the result of the method of ascertainment, which can lead to partial distortion of the observations. In both contexts, the observed value x no longer represents an observation on the original variable X, but constitutes an observation on a random variable Y whose distribution (the observed distribution) is a distorted version of the distribution of X (original distribution).

Such practical situations have been noticed since over a century ago (e.g. Lexis [1879]; Student [1919]). The Lexis ratio appears to be the first statistic suggested for testing for the presence of over- or under-dispersion relative to a binomial hypothesized model in populations structured in clusters. Also, for count data, Fisher ([1950]) considered using the sample index of dispersion for testing the appropriateness of a Poisson distribution for an observed variable Y.

The paper is structured as follows. Section 2 introduces the reader to the various approaches to modelling overdispersion in the case of traditional experimental contexts. Section 3 highlights approaches in the case of observational study contexts. Section 4 focuses on the case of heterogeneous populations followed by Sections 5 and 6, which look into a particular type of distribution, the generalized Waring distribution, and its relevance in the context of applications under the various scenaria leading to over-dispersion mentioned above. Through the prism of these scenaria, a bivariate version of it is also presented, and its use in applied contexts is discussed in Section 7. A multivariate version of it is also given, and its application potential is outlined in Section 8. Finally, Sections 9 and 10 present a model for temporally evolving data, the multivariate generalized Waring process, and an application illustrating its practical potential.

As the field of accident studies has received much attention, and various theories have been developed for the interpretation of factors underlying an accident situation, most of the models will be presented in accident or actuarial data analysis contexts. Of course the results can be adapted in a great variety of situations with appropriate parameter interpretations so that they can be applied in several other fields ranging from economics, inventory control and insurance through to demometry, biometry, psychometry and web access modeling, as the case is with the application discussed in Section 10.

2. Modelling over - or under - dispersion in traditional experimental contexts

One important, but often ignored by data analysts, implication of using single parameter distributions such as the Poisson distribution to analyse data is that the variance can be determined by the mean, a relation that collapses by the presence of overdispersion. If this is ignored in practice, any form of statistical inference may induce low efficiency, although, for modest amounts of overdispersion this may not be the case (Cox [1983]). So, insight into the mechanisms that induce over (or under) dispersion is required when dealing with such data. Such insight can be gained by looking at the above-mentioned potential triggering sources as classified by Xekalaki ([2006]).

2.1 Lack of independence between individual responses

In accident study related contexts, where one is interested in the total number of reported accidents Y= i = 1 n Y i in a total number of accidents, n, that actually occurred, when accidents are reported with equal probabilities p = P(Y i  = 1) = 1 − P(Y i  = 0), but not independently (Cor(Y i , Y j ) = ρ ≠ 0), the mean of Y will still be E(Y) = np, but its variance will be V Y =V i = 1 n Y i =np 1 p +2 n 2 ρp 1 p =np 1 p 1 + ρ n 1 , which exceeds that anticipated under a hypothesized independent trial binomial model if ρ > 0 (over-dispersion) and is exceeded by it if ρ < 0 (under-dispersion).

2.2 Contagion

Another common reason for a variance differing from what is anticipated, is that when the assumption that the probability of the occurrence of an event in a very short interval is constant fails. This framework is the classical contagion model (Greenwood and Yule [1920]; Xekalaki [1983a]).

In data modelling problems faced by actuaries, for example, this model postulates that initially all individuals have the same probability of incurring an accident, but later this probability changes by each accident sustained. It is assumed, specifically, that none of the individuals has had an accident (e.g. new drivers or persons who are just beginning a new type of work), but later the probability with which a person with Y = y accidents by time t will have another accident in the time period from t to t + dt is of the form (k + my)dt. This leads to the negative binomial as the distribution of Y with p.f. P Y = y = k / m y e kt 1 e mt y with μ = E(Y) = k(emt − 1)/m, and V(Y) = kemt(emt − 1)/m = μemt.

2.3 Clustering

A frequently overlooked clustered structure of the population may also induce over - or under - dispersion.

In an accident context again, an accident is regarded as a cluster of injuries:

The number Y of injuries incurred by persons involved in N accidents can naturally be thought of as expressed by the sum Y = Y1 + Y2 + … + Y N of the numbers Y i of injuries resulting from the i ‐ th accident, assumed to be i.i.d. independently of the total number of accidents N, with mean μ and variance σ2. In this case, E Y =E i = 1 N Y i =μE N and V Y =V i = 1 N Y i = σ 2 E N + μ 2 V N .

So, when N is a Poisson variable with mean E(N) = θ = V(N), the last relationship leads to overdispersion or underdispersion according as σ2 + μ2 is greater or less than 1.

The first such model was introduced by Cresswell and Froggatt ([1963]) in a different accident context whereby each person is liable to spells of weak performance during which all of the person’s accidents occur. So, if the number N of spells in a unit time period is Poisson distributed with mean θ, and within spells a person can have 0 accidents with probability 1 − m log p, m > 1/log p, 0 < p < 1 and n accidents (n ≥ 1) with probability m(1 − p)n/n, m, n > 0 the observed distribution of accidents is the negative binomial distribution with probability function P Y = y = θm + y 1 y p θm 1 p y . This model, known in the literature as the spells model, can also lead to other forms of overdispersed distributions (e.g. Xekalaki [1983a], [1984a]).

2.4 Heterogeneity

Assuming a homogeneous population when in fact the population is heterogeneous, i.e., when its individuals have constant, but unequal probabilities of sustaining an event can also lead to overdispersion. In this case, each member of the population has its own value of the parameter θ and probability density function f(⋅ ; θ).

So, with θ regarded as the inhomogeneity parameter and varying from individual to individual according to any continuous, discrete, or finite step distribution G(⋅) of mean μ and variance σ2, one is led to an observed distribution for Y with probability density function f Y (y) = E G (f(y; θ)) = ∫  Θ f(y; θ)dG(θ), where Θ is the parameter space. Models of this type are known as mixtures. (For details on their application in the statistical literature see e.g. Karlis and Xekalaki [2003]; McLachlan and Peel [2001]; Titterington [1990]). Under such models, the variance of Y consists of two additive components, one representing the variance part due to the variability of θ and one due to the inherent variability of Y if θ did not vary, i.e., V(Y) = V(E(Y|θ)) + E(V(Y|θ)). This offers an explanation as to why mixture models are often referred to as overdispersion models.

It should be noted that a similar idea forms the basis for analysis-of-variance (ANOVA) models, where the total variability can be split into additive components, the ‘between groups’ and the ‘within groups’ components. In the case of the Poisson (θ) distribution, we have in particular that V(Y) = E(θ) + V(θ). Based on the fact that in this case, the factorial moments of Y coincide with the moments of θ about the origin, Carriere ([1993]) proposed a test of the hypothesis that a Poisson mixture fits a data set.

Mixed Poisson distributions were first introduced by Greenwood and Woods ([1919]) in the context of accident studies. Assuming that an individual’s accident experience Y|θ is Poisson distributed with parameter θ that was varying from individual to individual according to a gamma distribution with mean μ and index parameter μ/γ, they obtained a negative binomial distribution for Y with probability function P Y = y = μ / γ + y 1 y γ / 1 + γ y 1 + γ μ / γ and with mean and variance given respectively by E(Y) = μ and V(Y) = μ(1 + γ), where γ represents the over-dispersion parameter.

The mixed Poisson process has been popularised in the actuarial literature by Dubourdieu ([1938]) gamma mixed case was treated by Thyrion ([1969]).

Numerous other mixtures have since then been proposed in the literature for interpreting overdispersion in data, such as binomial mixtures (e.g. Tripathi et al. [1994]), negative binomial mixtures (e.g., Xekalaki [1983a], [c], [1984a]; Irwin [1975]), normal mixtures (e.g. Andrews and Mallows [1974]) and exponential mixtures (e.g. Jewell [1982]). Discrete Poisson mixtures with finite step distributions for the Poisson parameter θ have also been proposed, the interest being on creating clusters of data by grouping the observations on Y according to some criterion (cluster analysis). The number of clusters can be decided on the basis of a testing procedure for the number of components in the finite mixture (Karlis and Xekalaki [1999]).

2.4.1 Heterogeneity in mixture models treating the parameter θ as the dependent variable in a regression model

Heterogeneity in models with explanatory variables can be modelled, by assuming that Y has a parameter θ varying from individual to individual according to some regression model θ = η(x; β) + ε, where x is a vector of explanatory variables, β is a vector of regression coefficients, η is a function of a known form and ε has some known distribution. Such models are known in the literature as random effect models and have been extensively studied within the broad family of Generalized Linear Models. As a simple example in the case of a single covariate, say X, consider data Y i  , i = 1, 2, … , n coming from a Poisson population with mean θ determined by log θ = α + βx + ε for some constants α, β and with ε having a distribution with mean 0 and variance say ϕ. In this case, the marginal distribution of Y is no longer the Poisson distribution. It is a mixed Poisson distribution, with some mixing distribution g(⋅) clearly depending on the distribution of ε. In particular, YPoisson t e α + βx t g t where t = eε.

Negative Binomial and Poisson Inverse Gaussian regression models have also been proposed as overdispersed alternatives to the Poisson regression model (e.g. Lawless [1987]; Dean et al. [1989]; Xue and Deddens [1992]). The case of a two finite step distribution, the finite Poison mixture regression model of Wang et al.’s ([1996]) results. The similarity of the mixture representation and the random effects one is discussed in Hinde and Demetrio ([1998]).

In meta-analysis contexts, overdispersion (or underdispersion) refers to variance inflation (or deflation) relative to that anticipated by the fixed effects model. Two possible causes of such phenomena are a population structure in clusters or mixing resulting in a compound distribution. Kulinskaya and Olkin ([2014]) proposed approaching the problem of specification of a random effects model in meta-analysis in terms of a multiplicative model for the distribution of the effect size parameters that allows inflation or deflation. The model considered was motivated by overdispersion induced by intra-class correlation in the model assumed for the distribution of the i-th effect size estimate. In particular, the variance of the estimator θ ^ i of the effect size parameter θ i in the i-th study is assumed to be of the form σ θ ^ i 2 = 1 + α n i γ σ i 2 , where α(n i ) are some known functions of the sample sizes n i , σ i 2 is the within the i-th study variance, i = 1, 2, …, k and γ is interpreted as an intra class correlation parameter.

2.4.2 Estimation and testing for overdispersion under mixture models

The structure of mixture models, including random effect models, entails different forms of variance-to-mean relationships. So, viewing the mean and variance of Y as represented by E(Y) = μ(β), and V(Y) = σ2(μ(β), λ) respectively for some parameters β, λ a number of estimation approaches have been proposed in the literature based on moment methods (e.g. Breslow [1990]; Lawless [1987]; Moore [1986]) and quasi or pseudo likelihood methods (e.g. Davidian and Carroll [1988]; McCullagh and Nelder [1989]; Nelder and Pregibon [1987]). The above representation for the mean and variance of Y allows also estimation in the case of multiplicative overdispersion as in McCullagh and Nelder ([1989]).

Testing for the presence of overdispersion or underdispersion, on the other hand, can be done by means of asymptotic arguments. Let f(y; θ) denote the density function of a random variable Y in the initial model. Cox ([1983]) showed that, under regularity conditions, the density of y in the overdispersed model, f Y (y), admits a representation of the form f Y y = E Θ f y ; θ =f y ; μ θ + 1 2 σ θ 2 2 f y ; μ θ μ θ 2 +Ο 1 / n , with μ θ =Ε θ , σ θ 2 =V θ and Θ is the parameter space. This in turn implies that f Y (y) can be put in the form f(y; μ θ )(1 + εh(y, ϕ θ )), where h y , ϕ θ = log f y ; μ θ μ θ 2 + 2 log f y ; μ θ μ θ 2 .

This representation entails overdispersion if ε > 0, underdispersion if ε < 0 and, of course, none of these complications if ε = 0. Cox ([1983]) suggested a testing procedure for the hypothesis ε = 0, which can be regarded as a general version of standard dispersion tests.

2.5 Zero adjusted models

It would be interesting to note that another aspect of the population structure that is often responsible for the phenomenon of over-dispersion or under-dispersion is the presence of an excess or a scant number of zeros. Though the models discussed in Sections 2.3 and 2.4 may capture over-dispersion or under-dispersion rather well, they cannot capture excess or scarcity of zeros. In the literature, this question has been addressed by two types of models known as zero-inflated (or zero-deflated) models, and hurdle models. A unified representation of the models is provided by f(y; ω) = ωI{0}(y) + (1 − ω)f Y (y), where Y is the count variable, I{0}(⋅) is the indicator function and ω is a constant, whose values, if in (0,1) render a hurdle model for f Y (0) = 0, a zero-inflated model for f Y (0) ≠ 0, while negative values of it render a zero-deflated model.

Obviously, ω can be interpreted as the proportion of excess zeros in the case of the first two models and the above representation explains why there can be regarded as having a dual nature. They are (finite) mixtures, which account for heterogeneity, while at the same time, they are capturing a population structure in two clusters. However, in the case ω < 0 (zero-deflation), the model ceases to admit a mixture interpretation.

Zero-inflated and hurdle models have mostly been used for Poisson, generalized Poisson or negative binomial count distributions in various contexts (e.g. Ridout et al. [2001]; Gupta et al. [2004]; Famoye and Singh [2006]). Gupta et al. ([1996]) proposed a zero-adjusted generalized Poisson distribution and studied the effect of not using an adjusted model for zero-inflation or -deflation when the occurrence of zeroes differs from the anticipated one. Reviews of such models can be found in Ridout et al. ([1998]), Gschlößl and Czado ([2008]) and Ngatchou-Wandji and Paris ([2011]).

3. Over– or under–dispersion in observational study contexts - the effect of the method of ascertainment

Often, in connection with data collection based on observation or on recording values as produced by nature, the original distribution may not be reproduced due to various reasons. These may lead to partial destruction or partial enhancement (augmentation) of observations. The models that have been introduced to deal with such situations are respectively known as damage models introduced by Rao ([1963]) and generating models introduced by Panaretos ([1983]). The distortion mechanism is usually assumed to be manifested through the conditional distribution of the resulting random variable Y given the value of the original random variable X. Hence, the resulting (observed) distribution is a distorted version of the original distribution that can be represented as a mixture of the distortion mechanism. In particular, in the case of damage, P Y = r = n = r P Y = r | X = n P X = n ,r=0,1,2,, while, in the case of enhancement, P Y = r = n = 1 r P Y = r | X = n P X = n ,r=1,2,.

Various forms of distributions have been considered for the distortion mechanism in the above two cases. In the case of damage, the most popular forms have been the binomial distribution Rao ([1963]), mixtures on p of the binomial distribution (e.g. Panaretos [1982]; Xekalaki and Panaretos [1983]) whenever damage can be regarded as additive (Y = X − U, U independent of Y) or in terms of the uniform distribution in (0, x) (e.g. Dimaki and Xekalaki [1990], [1996]; Xekalaki [1984b]) whenever damage can be regarded as multiplicative (Y = [RX], R independent of X and uniformly distributed in (0, 1)). The latter case has also been considered in the context of continuous distributions by Krishnaji ([1970]). The generating model was introduced and studied by Panaretos ([1983]).

Both, the generating model and the damage model offer a perceptive approach in actuarial contexts where one is interested in modelling the distributions of the numbers of accidents, of the damage claims, and of the claimed amounts. These models become relevant due to the fact that people have in general a tendency to under report their accidents, so that the reported (observed) number Y is less than or equal to the actual number X (Y ≤ X), but tend to over report damages incurred by them, so that the reported damage Y is greater than or equal to the true damage X (Y ≥ X).

Another type of distortion is induced by the adoption of a sampling scheme that assigns to the units in the original distribution unequal probabilities of inclusion in the sample. As a result, the value x of X is observed with a frequency that noticeably differs from that anticipated under the original density function f X (x; θ). It represents an observation on a random variable Y whose probability distribution is the results of adjusting the probabilities of the anticipated distribution through weighting them with the probability with which the value x of X is included in the sample. So, if this probability is proportional to some weight function, w(x, β), β ∈ R, the recorded value x is a value of Y having density function f Y (x; θ, β) = w(x; β)f x (x; θ)/E(w(X; β)).

Distributions of this type are known as weighted distributions ( see, e.g. Cox [1962]; Fisher [1934]; Patil and Ord [1976]; Rao [1985]). For w(x; β) = x, these are known as size biased distributions. In actuarial data modelling contexts again, the weight function can represent reporting bias. In the context of reporting accidents or placing damage claims, for example, it can have a value that is directly or inversely analogous to the size x of X, the actual number of incurred accidents or the actual size of the incurred damage. The functions w(x; β) = x and w(x; β) = βx (β > 1 or β < 1) are plausible choices. So, for example, in the case of a Poisson (θ) distributed X, these lead to distributions for Y that are of Poisson type. In particular, the weight function w(x; β) = x leads to a shifted Poisson distribution with probability function P(Y = x) = e− θθx − 1/(x − 1) !, x = 1, 2, …, while the choice w(x; β) = βx leads to a Poisson distribution P(Y = x) = e− θβ(θβ)x/x !, x = 0, 1, …. The value of the variance of the observed variable Y under the first assumption for w(x; β) is 1 + θ and exceeds that of X (overdispersion), while under the second assumption it is θβ implying overdispersion for β > 1 or underdispersion for β < 1.

4. Looking closer into the case of heterogeneity

Assuming a specific form for the distribution of the population that generated a data set implies that the mean to variance relation is given for this distribution, e.g. the Poisson distribution with a mean to variance ratio equal to unity. As has become obvious from the above, this relationship ceases to hold in real data sets however. This being rarely the case, flexible families have been sought in the literature by allowing the parameter θ of the original distribution to vary according to a distribution with probability density function, say g(⋅).

As mentioned before, a density function f X (⋅) is a mixture on the parameter θ of the distribution function f(⋅ ; θ) with some mixing distribution G θ (⋅), which can be continuous, discrete or a finite step distribution, if it can be written in the form f X (x) = E G (f(x; θ)) = ∫ Θ f(x; θ)dG(θ), where Θ is the parameter space. An appropriate choice of a mixing distribution allows its parameter to vary and acts as a means of “loosening” the structure of the initial model, thus offering more realistic interpretations of the mechanisms that generated the data.

A large number of Poisson mixtures have been developed. (For an extensive review, see Karlis and Xekalaki [2003], [2005]). The derivation of the negative binomial distribution, as a mixture of the Poisson distribution with a gamma distribution as the mixing distribution, originally obtained by Greenwood and Yule ([1920]) constitutes a typical example. Mixtures of the negative binomial distribution have also been widely used in connection with applications in a plethora of fields. These include the Yule distribution (Yule [1924]; Irwin [1941]; Xekalaki [1983c], [1984b]) the Waring distribution (Irwin [1963]) and the generalized Waring distribution (Irwin [1968], [1975]; Xekalaki [1981], [1983a], [1984a]), which contains the Yule distribution and the Waring distribution as a special cases.

In what follows, we focus on the generalized Waring distribution and its relevance in accident data modeling contexts.

5. The generalized Waring distribution

This was introduced by Irwin ([1968]) in connection to biological data and later was shown by him to arise as an accident distribution (Irwin [1975]). It is the distribution with probability generating function given by

G s = ρ k a + ρ k F 1 2 a , k ; a + k + ρ ; s ,α,k,ρ>0

with 2F1(a, b; c; z) denoting the Gauss hypergeometric function r = a x a r b r z r / c r r ! , where h(l) = Γ(h + l)/Γ(h), h > 0, l ∈ R.

Irwin’s starting point was Waring’s expansion (hence the distribution’s name) given by 1 x a = r = 0 a r x r + 1 , which he then generalized to 1 x a k = r = 0 a r k r x k + r 1 r ! ,α,k>0.

Hence, by multiplying both sides by ρ(k), where ρ = x − a > 0, the successive terms of the resulting series could he regarded as defining a probability function, which he termed the generalized Waring distribution with parameters α, k, ρ. In particular, the probability function of the generalized Waring distribution with parameters α, k, ρ is given by

p r = ρ k a + ρ k a r k r a + k + ρ r 1 r ! ,α,k,ρ>0,r=0,1,2,

where h(l) = Γ(h + l)/Γ(h).

Notwithstanding the complexity of its structure, this distribution was shown to offer an insightful tool in the interpretation of accident data as will be seen below. Among its aspects that can be of practical value, is that, as shown by Xekalaki ([1983b]), it is a discrete self-decomposable distribution in Steutel and van Harn’s ([1979]) sense, hence infinitely divisible, implying that its probability generating function can be put in the form G s =exp λ s 1 1 g u 1 u du , where λ = p1/p0 and g(⋅) denotes the probability generating function of the distribution with probability function satisfying the recurrence relation

q n =λ n ak + ρ a + k + ρ ak a + k + ρ + n j = 0 n 1 q j n j a + k + ρ + n 1 j / a + n 1 j k + n 1 j

6. The generalized Waring distribution in relation to accident theory

The hypotheses that have formed the basis of investigations into the occurrence of accidents since almost a century ago are

  1. (i)

    Pure chance , giving rise to the Poisson distribution

  2. (ii)

    True contagion , i.e. the hypothesis that initially all individuals have the same probability of incurring an accident but that this probability is modified by each accident sustained.

  3. (iii)

    Apparent contagion (heterogeneity) , i.e. the hypothesis that individuals have constant but unequal probabilities of having an accident - the resultant distribution being a compound Poisson distribution (“accident proneness” model).

  4. (iv)

    The “Spells” Model , i.e each person is liable to periods of time during which the person’s performance is weak (spells). All of the person’s accidents occur within those spells. The numbers of accidents within different spells are independent and independent of the number of spells.

As already seen, the negative binomial distribution can be given a an accident proneness and a “spells” interpretation in the context of accident theory in terms of a gamma mixed Poisson distribution and a Poisson distribution generalized by a logarithmic distribution (Kemp [1967]).

Therefore, a good fit of the negative binomial is no help at all in distinguishing among the “proneness”, “contagion” and “spells” hypotheses. This is known as the discrimination problem between the compounded, contagion and generalized models for the negative binomial distribution and has been discussed by Arbous and Kerrich ([1951]); Bates and Neyman ([1952]); Gurland ([1959]) and Cane ([1974], [1977]). For an extensive bibliography on the accident hypotheses mentioned, see Kemp ([1970]).

6.1 Irwin’s “Proneness” model

As evident, in all three of the above models, the data are treated as if the individuals under observation were exposed to equal environmental risk, a fact criticized by Irwin ([1968]), who suggested a three-parameter distribution, which he called the “univariate generalized Waring distribution” (UGWD). He derived this distribution in a framework that allows separately for random factors, differences in the exposure of individuals to external risk of accident, and differences in proneness.

In particular, his model assumes a non homogeneous population with respect to personal and environmental attributes affecting the occurrence of accidents.

Let the distribution of the number, X, of accidents for individuals of equal proneness ν, and of equal exposure to external risk of accident λ|ν, i.e. λ for given ν), have probability generating function

G X | λ s =exp λ | ν s 1

in a unit time interval (0, 1). If the distributions of λ|ν and ν in the population at risk can be described by the probability density functions (pdf)

ν k exp λ / ν λ k 1 /Γ k ,v,k>0

and

Γ a + ρ ν a 1 1 + ν a + ρ / Γ ρ Γ a ,a,ρ>0

respectively, the pgf of the resulting distribution of accidents will be {ρ(k)2F1(a, k; a + k + ρ; s}/(a + ρ)(k), i.e. the univariate generalized Waring distribution with parameters a, k and ρ, which will be denoted by UGWD(a, k; ρ). Here, 2F1(a, b; c; z) denotes the Gauss hypergeometric function r = a x a r b r z r / c r r ! , where h(l) = Γ(h + l)/Γ(h), h > 0, l ∈ R. For more information about the UGWD the reader is referred to the work of Irwin ([1963], [1968], 1975); Xekalaki ([1981]) and the references therein and Xekalaki ([1983a]).

6.2 The “Contagion” model

Xekalaki ([1983a]), extended the assumptions of the classical contagion model developed by Greenwood and Yule ([1920]) by considering a population of individuals exposed to varying accident risk.

In particular, assume that at time t = 0 none of the individuals has had an accident. This would be true if, for example, with a population of new drivers or of individuals just beginning a new type of work. Suppose that during the time period from t to t + dt a person with x accidents by time t can incur another accident with a probability of {(k + x)/(1 + λt)}λdt (independent of the times of the previous accidents), where k is a positive constant and λ refers to the individual’s risk exposure. At t = 0, since x = 0, the probability of an accident is kλdt. Hence, what the model basically assumes is that, initially, the probability of having an accident is not the same for each individual, but depends on the external conditions; later, the probability is also affected by the number of preceding accidents. Under these assumptions and if differences in the exposure to accident risk can be thought of as governed by a distribution with probability density function given by {Γ(a + ρ)va − 1(1 + ν)− (a + ρ)}/{Γ(ρ)Γ(a)}, the final distribution of accidents over a unit period of time turns out to be UGWD(a, k; ρ).

The above derivation of the generalized Waring distribution closely relates to a modeling approach whereby the distribution of accident occurrences in a time internal (0, t) is regarded as underpinned by a stochastic process and, in particular, by a pure birth process {X t t = 0, 1, 2, …} where the probability of a person to incur an accident in (t, t + dt), having had x accidents by time t is P(Xt + δt = x + 1|X t  = x) = f λ (n, t)δt + o(δt).

Irwin ([1941]), followed later by Arbous and Kerrich ([1951]), derived the negative binomial distribution on the hypothesis solving the associated Kolmogorov forward differential equations by a method due to McKendrick ([1925]). Specifically, assuming that individuals can have during the time period from t to dt, individuals can have 0 accidents with probability 1 − f λ (x, t)dt, 1 accident with probability f λ (x, t)dt and > 1 accidents with probability 0, he solved the resulting system of Kolmogorov forward difference-differential equations

t P λ 0 , t = f λ 0 , t P λ 0 , t t P λ x , t = f λ x , t P λ x , t + f λ x 1 , t P λ x 1 , t , x 1

in terms of a single difference-differential equation involving the probability generating function G λ (s; t) of X t given by

t G λ s ; t = s 1 x = 0 s x f λ x , t P λ x , t

where G λ s ; t = x = 0 P λ x , t s x . (He obtained this equation by multiplying the i-th equation of the system by si − 1, i = 1, 2, … and summing the resulting equations).

Assuming further that f λ (x, t) = λ(k + mx), k, m > 0 and subject to the initial conditions G λ (1; t) = G λ (s; 0) = 1, he obtained for the distribution of accidents

G λ s ; t = e λmt s e λmt 1 k / m ,

i.e. the probability generating function of the negative binomial distribution with parameters k/m and (1 − e− λmt)− 1.

Relaxing Irwin’s implicit assumption that all individuals were exposed to the same accident risk, Xekalaki ([1981]) treated the parameter λ as referring to a variable risk exposure according to an exponential distribution with density ae− , a > 0 and obtained the generalized Waring distribution as the accident distribution. In particular,

G X t s = a 0 e e λmt s e λmt 1 k / m = a 1 s k / m mt 0 e λ a + kt mt 1 s s 1 e λ k / m = a 1 s k / m mt Γ a + kt / mt Γ 1 + a + kt / mt F 1 2 k m , a + kt mt ; a + kt mt + 1 ; s s 1 = a a + mt F 1 2 k m , 1 ; a mt + k m + 1 ; s

which is the probability generating function of the UGWD k m , 1 ; a mt .

This model was considered by Panaretos ([1989]) for the description of the evolution of surnames. Faddy ([1997]) provided a unifying approach to under- and over-dispersion relative to the Poisson distribution within a scheme of a similar nature, which generalizes the simple Poisson process that underpins the Poisson distribution. He demonstrated that any count distribution can be obtained by a suitable choice of f λ (x, t) and provided an expression for the system of Kolmogorov forward differential equations in terms of a matrix-exponential function.

Finally, Winkelmann ([1995]) looked at under- and over-dispersion using renewal theory by exploring the link between duration dependence and dispersion. He demonstrated that discrepancies between observed and nominal variances are conveyed by a hazard function of the waiting times that is not constant, but instead is a decreasing function of time inducing over-dispersion or an increasing function of time inducing under-dispersion.

6.3 The “Spells” model

Further, Xekalaki ([1983a]) considered a variant of the “spells” model due to Cresswell and Froggatt ([1963]) that rejects the presence of proneness and contagion.

Assume that every individual is liable to spells and that the number of spells in a given time period (0, t) is a Poisson variable with parameter θt, θ > 0. Suppose that no accidents occur outside spells and that the probability of an accident within a spell depends on the risk exposure of the particular individual. In particular, suppose that within a spell a person can have

or 0 accidents with probability 1 m log 1 + λ n accidents n 1 with probability m λ / 1 + λ n / n ,

0 < m < 1/log(1 + λ), λ > 0, where λ is the external risk parameter for the given individual. Assume further that the numbers of accidents arising out of different spells are independent and independent of the number of spells. Then, if differences in the risk exposure can be described by a beta distribution of the second kind with probability density function, {Γ(a + ρ)va − 1(1 + ν)− (a + ρ)}/{Γ(ρ)Γ(a)}, a, ρ > 0, the resulting accident distribution will have probability generating function given by

ρ a F 1 2 a , θmt ; a + θmt + ρ ; s / ρ + θmt a .

Hence, in a unit time period, the number of accidents follows the UGWD(a, θm; ρ).

It is worth noticing that the form of the distribution of λ in the last two models is more general than that considered by the proneness model. It is however, a reasonable choice as it implies a beta distribution of the first kind (Pearson Type I) for the parameter q = λ/(1 + λ) of the negative binomial distribution of X|λ.

6.4 Deciding about the underlying model

It is evident from the above, that three completely different sets of hypotheses give rise to exactly the same form of distribution and that while the UGWD may be a plausible model if accident proneness is a accepted as an established fact, a satisfactory fit of it is not to be taken as evidence for the validity of the proneness hypothesis. How can we then discriminate?

Statisticians have always been excited to look for ways of discriminating among different models that give rise to the same distribution. Most attempts seem to have been concentrated on distinguishing between the proneness and contagion models generating the negative binomial distribution. The papers by Bates and Neyman ([1952]) and Bates ([1955]) cover part of the work that has been done on the subject, though they primarily focus on distinguishing between different forms of contagion. Shaw and Sichel’s ([1971]) attempt was on proving or disproving proneness by ranking individual accident performance on a scale based on their average interval between successive accidents. However, the first systematic study on how one can discriminate between the proneness and contagion models of the negative binomial distribution appears to be that by Cane ([1974]).

She demonstrated, however, that one cannot distinguish between the two models, even with knowledge of the time sequence of accidents. She demonstrated, in particular, that the conditional distribution of the times, t i , i = 1, 2, …, n at which accidents occurred in a time period (0, T) is the same in both cases, namely that of an ordered sample from a uniform distribution over (0, T) with probability density function n ! T− n. In fact, this is the case for any compound Poisson accident distribution whose compounding distribution has finite moments (Cane [1977]), hence also for the UGWD(a, k; ρ).

This implies that the availability of information on the times of the occurrence of accidents is not sufficient to guide one’s choice between the proneness and contagion models.

However, as demonstrated by Xekalaki ([1983a]), there appears to exist a possibility in the framework of the Spells model. Consider, in particular, the problem of finding the joint distribution of times t i , i = 1, 2, …, n of accidents by individuals with n accidents in a unit period of time under the spells model. For fixed λ, accidents occur as events in a generalized Poisson process: X t = i = 1 N t Y i ,N t Poisson θt , where θ > 0, t ≥ 0 and Y i are identically and independently distributed with probability density function given by {Γ(a + ρ)va − 1(1 + ν)− (a + ρ)}/{Γ(ρ)Γ(a)}, a, ρ > 0. Consequently, the required probability function can be written as 0 1 + λ θm 1 t n i = 1 n λmθ 1 + λ θm t i t i 1 1 d t i dH λ , with H(⋅) denoting the distribution function of the beta distribution of the second kind defined as above. Hence, the required probability is θm n ρ a a n θm + ρ a + n d t 1 d t n . Therefore, conditional on n accidents during a time period from 0 to 1, the joint pdf of t i , i = 1, 2, …, n, is n ! (θm)n/(θm)(n).

The obtained form differs from that arising under the proneness and contagion models. This fact is itself is very interesting as far as establishing the presence of spells is concerned, as it implies the following: if an observed accident distribution of the UGWD type has arisen from the spells model, the time intervals (0, t i ), i = 1, 2, …, n, given a total of n accidents, will be jointly distributed with the above density function. Any departure from this distribution is, then, evidence against the spells model. Of course, if on the available evidence one has to reject this form in favor of that obtained by Cane, then one is faced again with the question: “proneness or contagion?” This cannot be answered by studying the distribution of t i .

6.5 What does Irwin’s accident model offer beyond a good fit to the data?

The innovation brought by Irwin’s accident proneness model does not merely lie in the better fit it provides to accident data, but in the possibility of partitioning the total variance (σ2) into three additive components due to proneness σ ν 2 , liability σ λ 2 and randomness σ R 2 thus,

σ 2 = σ λ 2 + k 2 σ ν 2 + σ R 2 ,

Where

σ λ 2 = ak a + 1 ρ 1 1 ρ 2 1 σ ν 2 = a a + ρ 1 1 ρ 1 2 ρ 2 1 σ R 2 = ak ρ 1 1 σ 2 = ak a + ρ 1 k + ρ 1 ρ 1 2 ρ 2 1 .

There is still, however, a problem due to the fact that the UGWD(a, k; ρ) is symmetrical in a and k (UGWD(a, k; ρ) ∼ UGWD(k, a; ρ)). Hence, although one may consider that σ λ 2 + k 2 σ ν 2 represents the variance component due to all non-random factors, the mathematics alone cannot determine whether σ λ 2 represents the liability component and k 2 σ ν 2 the proneness component or vice versa. As a consequence, distinguishable estimates for the non-random variance components σ λ 2 and σ ν 2 cannot be obtained unless subjective judgement is made. This problem was addressed by Xekalaki ([1984a]) with the introduction of her bivariate form of the generalized Waring distribution.

7. The bivariate generalized Waring distribution

Generalizing further Irwin’s ([1963]) generalization of Waring’s expansion, we have for k, m, a > 0,

1 x a k + m = = 0 r = 0 a 1 ! r Δ r 1 x k Δ r 1 x + k + r m = r = 0 = 0 a r + 1 r + r ! ! Δ r 1 x k Δ 1 x + k + r m = r = 0 = 0 a r + k r m x k + m + r + 1 r ! 1 !

If x > a, the above series is convergent. Then, by letting ρ = x − a > 0 and multiplying both sides by ρ(k + m), leads to a double series of positive terms converging to unity. The general term of the series therefore can be regarded as defining a bivariate discrete probability distribution with probability function

p r , = ρ k + m a + ρ k + m a r + k r m a + k + m + ρ r + 1 r ! 1 ! ,a,k,m,ρ>0,r,=0,1,2,

In the remainder of the paper, we refer to this distribution as the bivariate generalized Waring distribution with parameters a, k, m and ρ and we denote it by BGWD(a; k, m; ρ).

7.1 The BGWD in relation to accident theory

Assume that individuals of proneness ν and liability λ i |ν for a period i of observation incur, over two non-overlapping time periods, accidents X, Y according to a double Poisson distribution G X , Y | λ 1 , λ 2 , ν s , t =exp λ 1 | ν s 1 + λ 2 | ν t 1 , λ 1 , λ 2 >0. Assume further that the liability parameters λ1|ν, λ2|ν are independently gamma distributed with densities Γ θ i ν θ i 1 e λ i | ν λ i θ i 1 , θ 1 k, θ 2 m,ν>0, whence for individuals with the same proneness ν, but varying liabilities, the numbers of occurring accidents over the two periods are jointly distributed as the double negative binomial with probability generating function

G X , Y | ν s , t = 1 + ν 1 s k 1 + ν 1 t m .

Letting now the proneness parameter ν be beta distributed with density function {Γ(a + ρ)va − 1(1 + ν)− (a + ρ)}/{Γ(ρ)Γ(a)}, a, ρ > 0, the probability generating function of the joint distribution of accidents over the two periods takes the form

G X , Y s , t = Γ ρ + a Γ ρ Γ a 0 + ν a 1 1 + ν a + ρ 1 + ν 1 s k 1 + ν 1 t m = ρ k + m a + ρ k + m F 1 a ; k , m ; a + k + m + ρ ; s , t ~ BGWD a ; k , m ; ρ ,

where F 1 a ; b , c ; d ; u , v = r , s = 0 a r + s b r c s u r v s / d r + s r ! s ! is Appell’s hypergeometric series and h(l) = Γ(h + l)/Γ(h), h > 0, l ∈ R.

Regarding separate estimation of the contribution of proneness, liability and randomness in a given accident situation over a period of observation whenever proneness is accepted as an established fact, Xekalaki ([1984a]) showed that rearranging the observed distribution in two non-overlapping sub-intervals and fitting the BGWD(a; k, m; ρ) to the resulting bivariate accident distribution does enable separate estimation of the variance components. This is demonstrated in Table 1.

Table 1 Estimators of the components of the variance of the generalized waring distribution

Further models leading to the BGWD provided by Xekalaki ([1984c]), provide the framework within which one can also obtain the BGWD as an accident distribution under the contagion and the spells accident theories.

8. The multivariate generalized Waring distribution

The n-variate version of the genaralized Waring distribution introduced and studied by Xekalaki ([1986]) is also obtained as an inverse factorial distribution. Its probability generating function is given by

G t ¯ = ρ k i a + ρ k i F D a ; k 1 , , k n ; a + i = 1 n k i + ρ ; t ¯

with % F D a ; β 1 , , β n ; γ ; t ¯ denoting Lauricella’s hypergeometric function given by

F D a ; β 1 , , β n ; γ ; t ¯ = r 1 , , r n a r i γ r i i = 1 n β i r i t i r i r i !

The probability function of it is given by

P r ¯ P X ¯ = r ¯ = ρ k i a + ρ k i a r i k 1 r i k n r n a + ρ + k i r i r 1 ! r n ! , r i = 0 , 1 , 2 , ; i = 1 , , n

and its probabilities are related by the following first order recurrences, which facilitate their computation

P l 1 , l 2 , , l h 1 , l h + 1 , l h + 1 , , l n P l 1 , l 2 , , l n = a + i = 1 n l i k n + l n a + i = 1 n k i + i = 1 n l i l n + 1 ,l=0,1,2,;i=1,2,,n

An interesting aspect of the bivariate and multivariate versions of the generalized Waring distribution is that their marginal distributions (conditional and unconditional) as well as their convolution are of the same form (UGWD’s), properties that exhibit a symmetry analogous to that existing in the case of the multivariate normal distribution. Further, the generalized Waring distribution is self-decomposable (Xekalaki [1983b]).

9. The Generalized Waring Process (gWp)

Looking into how temporally evolving data from the wide spectrum of application contexts that can reasonably be viewed from the perspective of the frameworks discussed in Sections 6, 7 and 8 can be treated, Xekalaki and Zografi ([2008]) defined and studied the generalized Waring process. In establishing its definition, the structural properties of both the bivariate and the multivariate versions of the generalized Waring distribution played a significant role. This process, analogously to the case of Poisson and Pólya processes, which can be obtained as limiting cases of it, was shown to be a Markov process.

Let {N(t), t ≥ 0} be a counting process. This is said to be a generalized Waring process with parameters a, k, ρ > 0, denoted by gWp(a, k; ρ), if (i) N(0) = 0, (ii) N(t) is a Markov process, and (iii) N(t + h) − N(t) has the generalized Waring distribution with parameters a, k; ρ for h > 0, t ≥ 0. The process starts at 0, it has stationary increments and

P N t = n = ρ kt ρ + a kt a n kt n a + ρ + kt n 1 n !

i.e., N(t) has a generalized Waring distribution with parameters a, kt; ρ.

The transition probabilities of the generalized Waring process are given by

p m , n s , s + t = P N s + t = n | N s = m = Γ a + n Γ a + m kt n m n m ! ρ + ks a + m ρ + ks + kt a + n p 0 , n 0 , t = P N t = n | N 0 = 0 = ρ kt ρ + a kt a n kt n a + ρ + kt n 1 n ! = P N t = n

with the last equality indicating that the generalized Waring process is a non-homogenous Markov process. Its mean and variance are respectively

E N t = akt ρ 1 andVar N t = akt ρ + kt 1 ρ + α 1 ρ 1 2 ρ 2

Note that since the generalized Waring process is a stationary process and its mean is of the form E[N(t)] = ηt, the above formula implies that its intensity is η = ak/(ρ − 1). Its variance can be split into three additive components, thus

Var N t = σ Λ t 2 + kt 2 σ ν 2 + σ R 2

with the liability and random components dependent on time. In particular,

σ Λ t 2 =akt a + 1 ρ 1 1 ρ 2 1 ; σ ν 2 =a a + ρ 1 ρ 1 2 ρ 2 1 ; σ R 2 =akt ρ 1 1 .

9.1 The generalized Waring process in an accident proneness context

We consider a population which is inhomogeneous with respect to personal and environmental attributes affecting the occurrence of accidents. The terms “accident proneness” and “accident liability” are again used to refer respectively to a person’s predisposition to accidents, and to a person’s exposure to external risk of accident with the conditional distribution of the random variable λ given ν describing differences in external risk factors among individuals. Liability fluctuations over a time interval (t, t + h) depend on the length h of the interval and are described by a distribution for λ|ν with probability density function λkh − 1e− λ/(νh)(νh)− kh/Γ(kh). Allowing further the parameter ν have a beta distribution of the second kind with parameters a and ρ and density function ϕ given by

ϕ(ν) = Γ(a + ρ)νa − 1(1 + ν)− (a + ρ)/[Γ(a)Γ(ρ)], a, ρ ≥ 0, we obtain for the distribution of the number of accidents N(t):

P N t + h N t = n = ρ kh a + ρ kh a n kh n a + ρ + kh n 1 n !

and

P N t = n = P n t = ρ kt a + ρ kt a n kt n a + ρ + kt n 1 n ! ,n=0,1,

So, the process arising in the context of this model, satisfies the defining conditions of the generalized Waring process.

9.2 The generalized Waring process in the context of a spells model

Xekalaki and Zografi ([2008]) showed that the generalized Waring process could also be used in modeling temporally evolving data in the context of a spells model. Assume again that each person is liable to spells and that no accidents can occur outside spells. Let S(t), t = 0, 1, 2, …, the number of spells up to a given moment t, be a homogeneous Poisson process with rate k/m, k > 0, the number X i of accidents within a spell i be a random variable with a logarithmic series distribution with parameters m and ν and probability function given by %P X i = n = m n ν 1 + ν n ,n1 with P(X i  = 0) = 1 − m log(1 + ν), ν > 0, 0 < m < 1/log(1 + ν), and the numbers of accidents arising out of different spells be independent and independent of the number S(t) of spells. Here ν is regarded as the external risk parameter, too, which they assumed varying according to a beta distribution of the second kind with parameters a and ρ and probability density function given by Γ(a + ρ)νa − 1(1 + ν)− (a + ρ)/[Γ(a)Γ(ρ)], a, ρ ≥ 0. They then showed that the above framework leads to a process conforming with the postulates of the generalized Waring process, thus demonstrating its potential application in the context of the Spells model.

10. An application: modeling the counting process {N(s), s > 0} associated with the access pattern of a web site

As an illustration of the application potential of the generalized Waring process in other fields by appropriately adjusting the concepts and terminology used in this paper so as to have natural interpretations, we outline an example of a model for temporally evolving data on web access patterns provided by Xekalaki and Zografi ([2008]).

In this context, {N(s), s > 0} is the counting process associated with the access pattern of a web site, where, for any t > 0, N(t) represents the number of visits that the web pages on this particular site get within the interval (0, t). Note that the generalized Waring distribution was cited in Ajiferuke et al. ([2004]) as used by them to fit observed website visitation data for a given period, i.e, to model counts N(t0) of web visits on a given fixed time interval (0, t0).

Except for chance, visits to a web site can be regarded as affected by the intrinsic appeal of the particular site to web users (corresponding to proneness) as well as by exogenous factors (corresponding to external factors) such as, links provided by other sites to the particular site, how well the site is advertised etc.

Letting ν denote the intrinsic factors and λ|ν the exogenous factors. Then assuming that N(t)|λ follows a Poisson(λ(t)) distribution, where λ(t) = λt with λ|ν following a gamma distribution with density λkt − 1e− λ/(νt)(νt)− kt/Γ(kt), and with ν following a beta distribution of the second kind with density Γ(a + ρ)νa − 1(1 + ν)− (a + ρ)/[Γ(a)Γ(ρ)], a, ρ ≥ 0, then the unconditional distribution of N(t) is the GWD(a, kt; ρ), i.e. the process {N(t), t ≥ 0} is a generalized Waring process.

10.1 The data

The log files representing the hits on an e-shop site for the period from March 31, 2006 to April 30, 2006 have been used to fit this model. (A log file typically contains information on the times of visits per IP address per day). On the basis of such log files, the visits per day made by each of 468 IP addresses to a web site during the above period were enumerated yielding 468 paths of visits N i (t j ) made by IP address i up to and including time t j denoted by {N i (t j ), i = 1, 2, …, 468; j = 1, 2, …, 31}.

Moment estimates of the parameters of the generalized Waring process were obtained employing an estimation procedure for spatial point process data termed in the literature as the centered reduced moment method. The method introduced and studied by Ripley ([1976], [1977]) utilizes the intensity of the process and the mean number of further points within distance s of an arbitrary point of the process. In particular, the method utilizes the moment estimators %E N ^ s = μ ^ 1 = η ^ s=ns/h,E N ^ 2 s = μ ^ 2 =X/ n 2 ,E N ^ 3 s = μ ^ 3 = Z X / n 3 with %X= i = 1 n i j ϕ s 2 x i , x j ,Z= i = 1 n j i ϕ s x i , x j k i ϕ s x i , x k , where the quantities involved in the above equations represent weights defined, for each value x i in the collection of points {x i  : i = 1, 2, …, n} of the process within a time interval of length h, as follows: For each x i in {x i  : i = 1, 2, …, n} and a given s > 0, consider the interval of center x i and length s and assign to every point x j , j ≠ i in this interval the weight ϕ s (x i , x j ) = ω(x i , x j )− 1, where ω(x i , x j ) is the number of other points {x k , k ≠ i, k ≠ j} of the process that are included in the interval of length |x i  − x j | and center x i (see also Diggle and Chetwynd [1991]; Chetwynd and Diggle [1998], among others). The standard errors of the thus obtained parameter estimators can in principle be determined by simulation, but the associated computations are formidable. Approximation formulas exist only for the case of homogeneous planar Poisson process, while, for the class of stationary Cox process, there is no obvious way to obtain estimable expressions as noted by Chetwynd and Diggle ([1998]).

The observed paths were compared to the corresponding time series of simulated realizations of the generalized Waring process over the same time segment. For each IP address, 100 simulated realizations of the gWp(a, k; ρ), were obtained and each of the observed time series paths was compared to the corresponding simulated ones. On average, the realizations of the generalized Waring process exhibited similar structural characteristics, notably recognizable, to those of the paths of the observed time series. For illustration purposes, the path of the observed time series associated with one of the IP addresses considered is presented in Figure 1. In the graph, the path is superimposed by a sample of three of the 100 corresponding simulated realizations of the gWp(a, k; ρ). Inspection of the graph provides a visual appreciation of the degree of similarity in the structural characteristics of the path of the observed and the realized time series.

Figure 1
figure 1

Observed and simulated paths of the gWp(3.87, 0.83; 4.21) corresponding to the selected IP address (Xekalaki and Zografi[2008]).

Following Lewis ([1972]), Brillinger ([1978]) and Andersen et al. ([1993]), the closeness of the observed and realized time series was also checked using diagnostic plots based on the inverse-intensity residuals computed for each value x j in the collection of points {x j  : j = 1, 2, …, n} of the process given by % R θ ^ B j , η 1 = x i B j η ^ x i B j j I R + η ^ x dx where % B j = 0 , x j , θ ^ = a ^ , k ^ , ρ ^ 1 , η ^ x =η x , θ ^ is the fitted intensity and % I R + is the indicator function. These plots exhibit similar results. The plot corresponding to the data associated to the IP address considered is shown in Figure 2.

Figure 2
figure 2

Plot of inverse-intensity residuals corresponding to the selected IP address (Xekalaki and Zografi[2008]).