1 Introduction

In practice, it is natural to audit only a sample of accounting records to establish the correctness of the entire financial reporting process. Audit techniques are divided into two main areas: the so-called “internal audit” which is carried out internally to monitor the accounting process, and “external audit” carried out by accounting experts who certify the correctness of the accounting recording process. We shall focus on the latter. In general, auditing aims to verify whether there are material errors in a set of N accounting records or items. The inferential problem facing the auditor is to decide, on the basis of sample information, whether the errors found on the accounting records are attributable only by random material errors or by fraudulent actions. Each item in the sample provides the auditor with two types of information: the recorded amount (or book amount) and the audited amount (or corrected amount). The difference between these two amounts is called the error which is used to estimate the overall unknown error amount.

Auditors want to verify if the total error falls below a pre-assigned “tolerable error amount” denoted \({{\mathcal {A}}}\) hereafter. This can be achieved by calculating an upper confidence bound for the total error. If this bound is lower than \({{\mathcal {A}}}\), the auditor concludes that no misstatement has been made. On the contrary, if this bound is larger than \({{\mathcal {A}}}\), then the auditor may decide to verify all the recorded amounts. Alternatively, a p-value calculated at \({{\mathcal {A}}}\) can be used instead.

The primary focus is on the upper bound of the confidence interval rather than point estimation. The fraction of incorrect items present in a sample can also be very variable leading to unreliable estimates and confidence bounds. In fact, two scenarios are possible. We may have a relatively high number of small errors which results in a high overall error rate. The second scenario is when we have a small amount of larger errors and a small overall error rate. Thus, the distribution of errors may be very skewed with many null errors. As a consequence, the upper limits of the confidence intervals based upon variance estimates and the central limit theorem are no longer adequate (e.g. Cox and Snell 1979). The actual coverage of these intervals is frequently lower than the chosen nominal level (Kaplan 1973; Neter and Loebbecke 1975; Beck 1980). In practice, auditors tend to use unconventional confidence interval limits (e.g. Horgan 1996), such as Stringer’s (1963) bounds. This approach, however, tends to give conservative limits with coverages larger than the nominal level.

Audit sample are often selected with “probability proportional to size” sampling without replacement, also called “monetary-unit sampling” (MUS) (e.g. Arens and Loebbecke 1981; Higgins and Nandram 2009). Large valued items containing the greatest potential of large overstatement, have more chance of being sampled.

Chen et al. (2003) proposed an empirical likelihood bound for population containing many zero values. This approach is limited to simple random sampling, and cannot be directly used with MUS. An analogous parametric likelihood-based approach based on mixture models was proposed by Kvanli et al. (1998). However, non-parametric approaches are preferable because it avoids making assumption about the distribution of the errors. We propose to use Berger and Torres’s (2016) non-parametric weighted empirical likelihood approach, which takes into account of the unequal selection probabilities inherent with MUS. Empirical likelihood providing confidence bounds driven by the distribution of the data (Owen 2001); that is, it tends to give large upper bounds with skewed data. This makes it particularly suitable for MUS. Bootstrap is another well-known non-parametric approach for confidence bounds. However, it may perform poorly with data containing many zero errors. In this paper, we compare numerically the empirical likelihood bound proposed by Berger and Torres’s (2016) with the Stringer’s (1963) bound.

This paper is organized as follows. Section 2 describes MUS and the point estimator of the total overstatement error. In Sect. 3, we describe the empirical bound proposed, the Stringer’s (1963) bound and other alternative bounds. The results of the simulation study are presented in Sect. 4.

2 Statistical sampling method in auditing

An accounting population consists of N line items with recorded (or book) values, \(\{z_i: i=1,\ldots ,N\}\), where \(z_i>0\). The audited (correct) amount of the N line items in the population is denoted by \(\{x_i: i=1,\ldots ,N\}\). The values \(x_i\) are unknown before sampling, whereas \(z_i\) are known.

The error in item i, is \(y_i:=z_i- x_i\). When \(y_i>0\), the i-th item is overstated and when \(y_i<0\), it is understated. We have 100% overstatement if \(y_i=z_i\). When \(y_i=0\), the account is error free. A large fraction of the items in the population are error free while the non-zero errors are usually highly skewed to the right (Johnson et al. 1981; Neter et al. 1985). The total error amount is defined as

$$\begin{aligned} {Y}_{\scriptscriptstyle { N }}:=\sum _{i=1}^{N}y_i= \sum _{i=1}^{N}t_i\,z_i, \end{aligned}$$
(1)

where

$$\begin{aligned} t_i:=\frac{y_i}{z_i} \end{aligned}$$

is called the fractional error or “taint” that is the fraction of error within \(z_i\).

The purpose is to estimate, on sample basis, the total error amount \({Y}_{\scriptscriptstyle { N }}\). More precisely, the auditors is mostly interested in obtaining an upper bound of a confidence interval derived from an estimate of \({Y}_{\scriptscriptstyle { N }}\). If the upper bound exceeds a “tolerable error amount\({{\mathcal {A}}}\), we conclude that there are significant material errors in the book values, or on the contrary there are only minor errors.

Generally, the audit processes consists in selecting samples with “monetary-unit sampling” (MUS) also called “dollar unit sampling” (Arens and Loebbecke 1981). According to this approach, an accounting balance can be considered as a group of monetary units that can be either correct or incorrect. If the selected monetary-unit falls within the ith item then a taint is observed. In practice a systematic random sample S of size n with unequal probabilities proportional to \(z_i\) is often selected (e.g. Madow 1949; Tillé 2006, Section 7.2). However, the approach proposed is not limited to systematic sampling.

Under MUS, an audit amount \(x_i\) is selected with probability \({\pi }_{i}:=nz_i{Z}_{\scriptscriptstyle { N }}^{\,-1}\); where

$$\begin{aligned} {Z}_{\scriptscriptstyle { N }}:=\sum _{i=1}^{N}z_i\end{aligned}$$

denotes the known total book amount. We usually have \(nz_i{Z}_{\scriptscriptstyle { N }}^{\,-1}<1\). However, with small population or right-skewed \(z_i\), we may have \(nz_i{Z}_{\scriptscriptstyle { N }}^{\,-1}>1\) for some units. In this case, we need to adjust the \({\pi }_{i}\) with the usual scaling method that can be found in Tillé (2006, Sect. 2.10); that is, \({\pi }_{i}=1\) if \(nz_i{Z}_{\scriptscriptstyle { N }}^{\,-1}>1\) and the remaining \({\pi }_{i}\) are adjusted so that \(\sum _{i=1}^{N}{\pi }_{i}=n\). The Horvitz and Thompson (1952) estimator of \({Y}_{\scriptscriptstyle { N }}\) is given by

$$\begin{aligned} {{\widehat{Y}}}_{n}:=\sum _{i\in S}{w}_{i}\,y_i, \end{aligned}$$
(2)

with \({w}_{i}:={\pi }_{i}^{-1}\). If \(nz_i{Z}_{\scriptscriptstyle { N }}^{\,-1}\leqslant 1\) for all i, scaling is not needed and (2) reduces to the mean per-unit \({{\widehat{Y}}}_{n}={Z}_{\scriptscriptstyle { N }}\,{\bar{\,t\,}}\), where \({\bar{\,t\,}}:=n^{-1}\sum _{i\in S}t_i\) is the sample mean of the taints.

3 Confidence bound for the total error amount

The interest of the auditors usually focuses on obtaining an upper confidence bound for \({Y}_{\scriptscriptstyle { N }}\), at a specified confidence level \(1-\alpha \in [0.5,1)\), e.g. \(\alpha =0.05\) or 0.01. If this upper bound exceeds a tolerable error amount \({{\mathcal {A}}}\), then there is statistical evidence of a possible material error. When this bound is less than \({{\mathcal {A}}}\), we conclude that the recorded values are a fair reflection of the accounts.

It is important to compute confidence intervals whose limits are reliable. The presence of low error rates means that \(y_i\) usually have a strongly positive asymmetric distribution, because small \(y_i\) are much more frequent than large \(y_i\). As a result, the upper limits of the naïve confidence intervals based on variance estimation and the central limit theorem [see (13) below] can be problematic as their coverage is generally below the confidence level \(1-\alpha \) (Kaplan 1973; Neter and Loebbecke 1975; Beck 1980), because the sampling distribution is usually not normal (Stringer 1963; Kaplan 1973; Neter and Loebbecke 1975, 1977). In addition, a negative correlation between \({{\widehat{Y}}}_{n}\) and standard error estimates can increase the probability of type II error and reduces the probability of type I error (Kaplan 1973). The lack of normality is the main reason for not using classical statistical inference, based on the central limit theorem (Ramage et al. 1979; Johnson et al. 1981; Neter et al. 1985; Ham et al. 1985).

Non-traditional heuristic estimation methods have been developed to overcome the above problems (e.g. Horgan 1996). These methods are known as “Combined Attribute and Variable” (CAV) (Goodfellow et al. 1974a, b) some of which will be described in Sects. 3.4 and 3.3. The Stringer’s (1963) bound, described in Sect. 3.3, is widely used by auditors. Swinamer et al.’s (2004) simulation study show that the upper bound is too conservative, with confidence level frequently greater than \(1-\alpha \).

3.1 Weighted empirical likelihood’s bounds proposed for MUS

Berger and Torres (2016) developed an empirical likelihood approach for unequal probability sampling. We show how this method can be used to derive a confidence bound for the total error \({Y}_{\scriptscriptstyle { N }}\).

The “maximum empirical likelihood estimator” is defined by

$$\begin{aligned} {{\widehat{Y}}}_{EL}=\arg \max \ell ({Y}), \end{aligned}$$

where \(\ell ({Y})\) is the following “weighted empirical likelihood function”:

$$\begin{aligned} \ell ({Y}):= \max _{{p}_{i}: i\in S} \biggl \{ \sum _{i\in S}\log (n{p}_{i}) : {p}_{i}> 0, \sum _{i\in S}{p}_{i}= 1, \sum _{i\in S}{p}_{i}{w}_{i}\Bigl ( y_i- \frac{{Y}{\pi }_{i}}{n} \Bigr ) = 0 \biggr \} , \end{aligned}$$
(3)

where \({w}_{i}:={\pi }_{i}^{-1}\) are weights. The function (3) is different from Owen’s (1988) and Chen et al.’s (2003) empirical likelihood functions, because the constraint within (3) contains the adjustments \({w}_{i}\) which take into account of the fact that the \(y_i\) are selected with unequal probabilities, under MUS. The function (3) can also be adjusted to accommodate stratification (see Berger and Torres 2016, for more details).

Using Lagrangian multipliers, we have that the set of \({p}_{i}\) that maximises \(\sum _{i\in S}\log (n{p}_{i})\) for a given \({Y}\) is given by

$$\begin{aligned} {p}_{i}({Y}):=\frac{1}{n} \Bigl \{ 1 + {w}_{i}{{\varvec{c}}}_i({Y})^{\top }{\varvec{\eta }}\Bigr \}^{-1} , \end{aligned}$$

where \({{\varvec{c}}}_i({Y})\) is the \(2\times 1\) vector function

$$\begin{aligned} {{\varvec{c}}}_i({Y}):=\Bigl \{ {\pi }_{i}, \Bigl ( y_i- \frac{{Y}{\pi }_{i}}{n} \Bigr ) \Bigr \}^{\top }\end{aligned}$$
(4)

and \({\varvec{\eta }}\) is the Lagrangian vector which is such that the constraint

$$\begin{aligned} n\sum _{i\in S}{p}_{i}({Y})\,{w}_{i}\,{{\varvec{c}}}_i({Y})= (n,0)^{\top }\end{aligned}$$
(5)

holds. Thus, (3) reduces to

$$\begin{aligned} \ell ({Y})= \sum _{i\in S}\log \bigl \{ n{p}_{i}({Y})\bigr \} = -\sum _{i\in S}\log \bigl \{ 1 + {w}_{i}{{\varvec{c}}}_i({Y})^{\top }{\varvec{\eta }}\bigr \}\cdot \end{aligned}$$
(6)

This function can be calculated numerically from the observed \(y_i\) (\(i\in S\)) and a given value \({Y}\).

In practice, the function (3) is not needed for point estimation, because it can be shown that \({{\widehat{Y}}}_{EL}={{\widehat{Y}}}_{n}\) given by (2). This function is used to derive an upper confidence bound. The \((1-\alpha )\)empirical likelihood confidence bound\({b}_{\alpha }\) is the largest root of

$$\begin{aligned} \chi ^2_{1, 1-2\alpha }-2\ell ({b}_{\alpha })= 0 , \end{aligned}$$
(7)

where \(\chi ^2_{1, 1-2\alpha }\) denotes the upper \((1-2\alpha )\)-th quantile of a \(\chi ^2\)-distribution with one degree of freedom. A root-finding algorithm, such that the Brent (1973) and Dekker’s (1969) method, can be used to find \({b}_{\alpha }\).

The quantity \({b}_{\alpha }\) is an upper confidence bound, because the convexity of \(-2\ell ({Y})\) implies that the equation \(\chi ^2_{1, 1-2\alpha }-2\ell ({Y})=0\) has two roots \({b}_{\scriptscriptstyle { L }}\) and \({b}_{\alpha }\), such that \({b}_{\scriptscriptstyle { L }}<{b}_{\alpha }\). Berger and Torres (2016) showed that

$$\begin{aligned} -2\ell ({Y}_{\scriptscriptstyle { N }})\overset{d}{\rightarrow }\chi ^2_{1} , \end{aligned}$$
(8)

where \(\chi ^2_{1}\) denotes the \(\chi ^2\)-distribution with one degree of freedom. Hence \(Pr\{-2\ell ({Y}_{\scriptscriptstyle { N }})\leqslant \chi ^2_{1, 1-2\alpha }\}\rightarrow 1-2\alpha \) or \(Pr\{{b}_{\scriptscriptstyle { L }}\leqslant {Y}_{\scriptscriptstyle { N }}\leqslant {b}_{\alpha }\}\rightarrow 1-2\alpha \), by using the convexity of \(-2\ell ({Y})\). Thus, \([{b}_{\scriptscriptstyle { L }},{b}_{\alpha }]\) is a two-sided \((1-2\alpha )\) confidence interval. Hence, \({b}_{\alpha }\) is indeed an upper confidence bound.

The computation of \({b}_{\alpha }\) involves a root-finding algorithm. A simpler and less computationally intensive approach based on a “p-value” of a one-side test, can be used to check if \({b}_{\alpha }\leqslant {{\mathcal {A}}}\) at a given level \(\alpha \), where \({{\mathcal {A}}}\) denotes the “tolerable error amount”. A p-value less than \(\alpha \) means that \({b}_{\alpha }\) is below \({{\mathcal {A}}}\). In other words, \({b}_{\alpha }\leqslant {{\mathcal {A}}}\) if p-value \(\leqslant \alpha \), and \({b}_{\alpha }>{{\mathcal {A}}}\) otherwise. This p-value is given by

$$\begin{aligned} \text{ p-value } :=\frac{1}{2}\Bigl [ 1+ (-1)^{\delta \{{{\widehat{Y}}}_{n}\leqslant {{\mathcal {A}}}\}} F\{-2\ell ({{\mathcal {A}}})\} \Bigr ]. \end{aligned}$$
(9)

This is the p-value of a one-side test. Here, \(F\{\cdot \}\) is the cumulative distribution of a \(\chi ^2\)-distribution with one degree of freedom. Here, \(\delta \{{{\widehat{Y}}}_{n}\leqslant {{\mathcal {A}}}\}=1\) if \({{\widehat{Y}}}_{n}\leqslant {{\mathcal {A}}}\) and \(\delta \{{{\widehat{Y}}}_{n}\leqslant {{\mathcal {A}}}\}=0\) otherwise. The value of \(F\{-2\ell ({{\mathcal {A}}})\}\) can be found from the usual statistics table of \(\chi ^2\)-distributions. Note that \({{\mathcal {A}}}<{{\widehat{Y}}}_{n}\) implies p-value \(\geqslant 0.5\), because \(F\{-2\ell ({{\mathcal {A}}})\}\geqslant 0\). It can be shown that p-value \(\leqslant \alpha \) implies \(2\ell ({{\mathcal {A}}})\geqslant \chi ^2_{1, 1-2\alpha }\) and \({{\widehat{Y}}}_{n}\leqslant {{\mathcal {A}}}\). The strict concavity of (6) implies that \({{\mathcal {A}}}\) is larger than the largest root \({b}_{\alpha }\) of (7). Hence \({b}_{\alpha }\leqslant {{\mathcal {A}}}\). The trivial case \({{\mathcal {A}}}<{{\widehat{Y}}}_{n}\) always implies \({b}_{\alpha }>{{\mathcal {A}}}\). In this case, we always have that p-value \(\geqslant 0.5\).

Berger and Torres (2016) showed that (8) holds conditionally on \(\{y_i, {\pi }_{i}: i=1,\ldots ,N\}\). Property (8) relies on regularity conditions, such as the existence of fourth moments of \({{\widehat{Y}}}_{n}\), \(n/N\rightarrow 0\) and that the central limit theorem holds for \({{\widehat{Y}}}_{n}\). In fact, \({{\widehat{Y}}}_{n}\) may not be normally distributed, because of the skewness of the distribution of \(y_i\). It turns out that for moderate n, simulation studies have shown that the distribution of \(-2\ell ({Y}_{\scriptscriptstyle { N }})\) is still well approximated by a \(\chi ^2\)-distribution, even with skewed \(y_i\) (Owen 1988; Berger and Torres 2016). Since empirical likelihood is a data driven approach, the bound \({b}_{\alpha }\) should capture the skewness of the \(y_i\).

3.2 Extension for large sampling fraction or strong correlation between selection probabilities and the errors

The approach described so far relies on \(n/N\rightarrow 0\), because (8) holds under this assumption. Berger and Torres (2016) proposed an empirical likelihood for non-negligible n/N or when the \({\pi }_{i}\) are strongly correlated with the \(y_i\). We described briefly this approach. The technical details can be found in Berger and Torres (2016) and Berger (2018). This approach is based on a “penalised empirical likelihood function” defined by

$$\begin{aligned} {\widetilde{\ell }}({Y}):= & {} \max _{{p}_{i}: i\in S} \biggl \{ \sum _{i\in S}\log (n{p}_{i}) - n\sum _{i\in S}{p}_{i}+n : {p}_{i}> 0,\\&\quad \sum _{i\in S}\Bigl ( n{p}_{i}{q}_{i}-\frac{{q}_{i}-1}{n} \Bigr ) = 1,\ n\sum _{i\in S}\Bigl ( {p}_{i}{q}_{i}-\frac{{q}_{i}-1}{n} \Bigr ) {w}_{i}\Bigl ( y_i- \frac{{Y}{\pi }_{i}}{n} \Bigr ) = 0 \biggr \} , \end{aligned}$$

where \({p}_{i}=n^{-1}\), if \({q}_{i}=0\). Here, \({q}_{i}=(1-{\pi }_{i})^{1/2}\) are Hájek’s (1964) finite population corrections. Note that \(n/N\rightarrow 0\) implies \({\pi }_{i}\rightarrow 0\) and \({q}_{i}\rightarrow 1\). If we replace \({q}_{i}\) by 1, we have that \({\widetilde{\ell }}(Y)\) reduces to (3). Berger and Torres (2016) showed that \(-2{\widetilde{\ell }}({Y}_{\scriptscriptstyle { N }})\overset{d}{\rightarrow }\chi ^2_{1}\) for non-negligible n/N. Thus, the \((1-\alpha )\)penalised empirical likelihood confidence bound\({\widetilde{b}}_{\alpha }\) is the largest quantity which is the solution to

$$\begin{aligned} \chi ^2_{1, 1-2\alpha }-2{\widetilde{\ell }}({\widetilde{b}}_{\alpha })= 0 \cdot \end{aligned}$$
(10)

The function \({\widetilde{\ell }}({Y})\) can be calculated by using the Lagrangian method as in (6). We expect \({\widetilde{b}}_{\alpha }\) to be smaller than \({b}_{\alpha }\) with non-negligible n/N. The p-value of the tolerable amount \({{\mathcal {A}}}\) is p-value \(:=0.5[1+(-1)^{\delta \{{{\widehat{Y}}}_{n}\leqslant {{\mathcal {A}}}\}}F\{-2{\widetilde{\ell }}({{\mathcal {A}}})\}]\).

Our simulation study in Sect. 4 also show that \({b}_{\alpha }\) given by (7) can be too conservative, when \(y_i\) is strongly correlated with \({\pi }_{i}\). This could be the case when the errors are mainly within the tail of \(z_i\) or when the \({\pi }_{i}\) are strongly correlated with the \(y_i\). In these situations, \({\widetilde{b}}_{\alpha }\) is less conservative and have better coverages, even when n/N is negligible.

The bound \({\widetilde{b}}_{\alpha }\) should be lower than \({b}_{\alpha }\), when the number of units with \({\pi }_{i}=1\) is large, because the variance of the sampling distribution is smaller (see Berger and Torres 2016, for more details).

With accounting populations with a very low error rates, we may have a “zero-error sample”; that is \(y_i=0\) for all the sampled items. In this case, \({{\widehat{Y}}}_{n}=0\) and an auditor evaluates the book amount as free of error. In this case, it is not be possible to obtain empirical likelihood bounds, because the functions \(\ell ({Y})\) and \({\widetilde{\ell }}({Y})\) cannot be computed when \(y_i=0\) for all \(i\in S\). The bound \({\widetilde{b}}_{\alpha }\) cannot be computed when \({q}_{i}y_i=0\) for all \(i\in S\); that is, when \(y_i=0\) for all \(i\in S\) such that \({\pi }_{i}<1\). In this case, it may not be possible to find \({p}_{i}\) that satisfies the constraint within \({\widetilde{\ell }}({Y})\), for a given \({Y}\). The more conservative bound \({b}_{\alpha }\) can still be computed in this situation, as long as \(y_i\ne 0\) for some units with \({\pi }_{i}=1\).

3.3 The stringer bound

Suppose that we are interested in cases of overstatement, i.e. \(x_i=z_i\) if \(x_i\leqslant z_i\). Let us also assume that the value of each overstatement does not exceed the declared value such that \(0\leqslant t_i\leqslant 1\). Let \({T}_1, \ldots , {T}_n\) be independent random variables that describe the taints, such that \(Pr(0 \leqslant {T}_i\leqslant 1) =1 \). Here, \(t_i\) is an observation of \({T}_i\). Let \(0\leqslant t_{(1)}\leqslant t_{(2)}\leqslant \ldots \leqslant t_{(n)}\leqslant 1\) be the ordered statistics of \(\{{T}_1, \ldots , {T}_n\}\). Let \(u_i\) be the \((1-\alpha )\) upper confidence limit for the binomial parameter when i errors are observed in a sample of size n. The quantity \(u_i\) is the unique solution to

$$\begin{aligned} \sum ^{i}_{k=0} \left( {\begin{array}{c}n\\ k\end{array}}\right) u_i^k (1-u_i)^{n-k} = \alpha , \qquad \text{ for }\ i = 0,1,\ldots ,n-1 ; \end{aligned}$$
(11)

with \(u_n=1\). The \(u_i\) can sometimes be calculated using the Poisson approximation instead of a binomial within (11). An upper bound for the total overstatement error can be obtained by combining the upper limits of the sample errors with the observed taints. The Stringer bound, at the significance level \(\alpha \) is defined as (e.g. Pap and van Zuijlen 1995).

$$\begin{aligned} b_{\alpha }^{(\scriptscriptstyle { S })}:={Z}_{\scriptscriptstyle { N }}\Biggl \{ u_0+ \sum _{i=1}^{n} ( u_i- u_{i-1})t_{(n-i+1)}\Biggr \}\cdot \end{aligned}$$

The bound relies on \(0\leqslant t_i\leqslant 1\), which is not necessary for the empirical limit \({b}_{\alpha }\) and \({\widetilde{b}}_{\alpha }\). When some taints \(t_i\) are negative, we can use the “Stringer offset bound” (e.g. Clayton and McMullen 2007) given by

$$\begin{aligned} b_{\alpha }^{(\scriptscriptstyle { S }\scriptscriptstyle { O })}:={Z}_{\scriptscriptstyle { N }}\Bigg \{ u_0+ \sum _{i=1}^{n} ( u_i- u_{i-1}) \max (0,t_{(n-i+1)}) + \frac{1}{n} \sum _{i=1}^{n} \min (t_i,0) \Bigg \}\cdot \end{aligned}$$

We have that \(b_{\alpha }^{(\scriptscriptstyle { S }\scriptscriptstyle { O })}=b_{\alpha }^{(\scriptscriptstyle { S })}\), when \(0\leqslant t_i\leqslant 1\).

The Stringer bound has been extensively studied in literature and many empirical studies confirm that the coverage level is at least equal to its nominal level. However, this bound is very conservative (Leitch et al. 1982; Reneau 1978; Anderson and Teitlebaum 1973; Wurst et al. 1989; Higgins and Nandram 2009) and is usually much larger than the total error (1). This is also confirmed by the simulation study in Sect. 4. The direct consequence is that auditors may reject an acceptable accounting populations (Leitch et al. 1982). Bickel (1992) studied the asymptotic behaviour of the Stringer bound and showed that in case of large samples the confidence level is frequently higher than its nominal level. Pap and van Zuijlen (1996) showed that the Stringer bound is asymptotically conservative. In Sect. 4, we show that the Stringer bound is more conservative than the empirical likelihood bound. Indeed, \({b}_{\alpha }\) and \({\widetilde{b}}_{\alpha }\) are usually smaller than \(b_{\alpha }^{(\scriptscriptstyle { S })}\) and have confidence levels close to \(\alpha \). The “Stringer offset bound\(b_{\alpha }^{(\scriptscriptstyle { S }\scriptscriptstyle { O })}\) can be less conservative, when some \(t_i\) are negative.

It is not possible to compute the empirical likelihood bounds (\({b}_{\alpha }\) or \({\widetilde{b}}_{\alpha }\)) with zero-error sample. However, the Stringer approach has the advantage of providing a bound in this situation. Indeed, when \(y_i=0\) for all the sampled items, \(t_i=0\), \({{\widehat{Y}}}_{n}=0\) and \(b_{\alpha }^{(\scriptscriptstyle { S })}={Z}_{\scriptscriptstyle { N }}u_0={Z}_{\scriptscriptstyle { N }}(1-\alpha ^{1/n})\). Since usually, \({{\widehat{Y}}}_{n}={Z}_{\scriptscriptstyle { N }}\,{\bar{\,t\,}}\), we can view \((1-\alpha ^{1/n})\) as an upper bound for the average taints. This upper bound decreases with n, reflecting the fact that with a large zero-error sample, the average taint has more change of being small.

3.4 Other confidence bounds

Fienberg et al. (1977) introduced a less conservative bound based on a multinomial distribution derived from MUS. The method is rather complex because it is necessary to maximise over a joint confidence region. Leslie et al. (1979) proposed a “cell bound” which can be much greater than the actual error amount when we have a low error rate (Plante et al. 1985). Dworin and Grimlund (1984, (1986) introduced the so-called “moment bound” which is obtained by approximating the sampling distribution with a three-parameters gamma distribution. The method of moments is used to estimate these parameters. Simulation studies shows that the moment bound gives coverage close to the nominal level, and is less conservative than the Stringer bound.

Fishman (1991) showed that Hoeffding’s inequality can be used to derive a confidence bound, which can be more conservative than the Stringer bound. Howard (1994) proposed a bound based on bootstrap and Hoeffding’s inequality. This bound is not uniformly better than the Stringer (1963) bound, when the accounts are characterized by low error rates.

When the non-zero accounts values can be described by a suitable parametric model, Kvanli et al. (1998) showed that it is possible to use a parametric likelihood ratio statistics to define a two-sided confidence interval for the mean error. The nominal value is achieved when this parametric model holds. However, the bound depends entirely on the parametric model. Assuming a model that does not follow the distribution of the account may affect the coverage. The method introduced in Sect. 3.1 is very similar, but it has the advantage of being non-parametric, because it is not necessary to assume a model for the errors.

4 Simulation studies

In this section, we compare the numerical performance of the empirical likelihood bounds proposed with the Stringer bound. The recorded values \(z_i\) are simulated from a skewed log-normal distribution,

$$\begin{aligned} \log (Nz_i) \sim {{\mathcal {N}}}(1,\sigma ^2=1.44), \qquad i=1,\ldots ,N \cdot \end{aligned}$$
(12)

We use this distribution, because \(z_i\) are monetary values which usually follow a right-skewed distribution. Furthermore, the main reason for using MUS is the skewness of \(z_i\). The resulting \({\pi }_{i}\), proportional to \(z_i\), are right-skewed, with some \({\pi }_{i}=1\), depending on the values of N and n. The distribution of the taints \(t_i\) is crucial, because it drives the sampling distribution of \({{\widehat{Y}}}_{n}\) and the upper bounds. Indeed, when \({\pi }_{i}<1\), we have that \({{\widehat{Y}}}_{n}={Z}_{\scriptscriptstyle { N }}\,{\bar{\,t\,}}\) and the sample mean \({\bar{\,t\,}}\) of the taints drives the sampling distribution of \({{\widehat{Y}}}_{n}\). We shall consider uniform, and skewed distributions, with positive and negative \(t_i\), and large fractions of \(t_i=0\) and 1. In the different simulation setup considered, we shall vary n as well as the distribution of \(t_i\). The values \(z_i\) generated are fixed and the same, for a given N. This isolates the effects of the distribution of the taints \(t_i\).

Consider \(N=10000,\ 1000\) and 700. The error in item i, is \(y_i=t_i\,z_i\), with \((100-r)\%\) of \(t_i\) are equal zero and the remaining \(r\%\) positive taints are generated randomly from uniform distributions \(\text{ Un }(t_{L},t_{U})\). Here, \(r\) denotes the error rate. We shall consider \(r=2\%,\,5\%\) and \(10\%\). Several ranges of positive taints are considered: \([t_{L},t_{U}]=[0.1,0.3],\,[0.2,0.7]\) and [0.5, 0.7]. The values generated \(y_i\), \(t_i\), \(z_i\) are treated as fixed. The MUS sample is based on a systematic procedure with random ordering of line items, selected with probability proportional to \(z_i\). The sample sizes considered are \(n=100,\ 200\) and 500. We consider 1000 replications. Consider a nominal coverage of \(1-\alpha =0.95\). The results are given in Tables 1 and 2.

We shall compare \(b_{\alpha }^{(\scriptscriptstyle { S })}\) with \({b}_{\alpha }\), \({\widetilde{b}}_{\alpha }\), and \(b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})}\), where \(b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})}\) is the following naïve bound based on the normal approximation.

$$\begin{aligned} b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})}:={{\widehat{Y}}}_{n}+ {\varPhi }^{- 1}(1-\alpha )\ {\widehat{v}}({{\widehat{Y}}}_{n})^{\frac{1}{2}} , \end{aligned}$$
(13)

where \(\varPhi (\cdot )\) is the cumulative function of a standardised normal distribution and \({\varPhi }^{- 1}(1-\alpha )\) is its \(1-\alpha \) quantile. Here, \({\widehat{v}}({{\widehat{Y}}}_{n})\) is Hartley and Rao’s (1962) consistent variance estimator for systematic sampling. It is well known that \(b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})}\) tends to be too small. Here, \(b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})}\) is used as a benchmark.

Several indicators are computed to assess the accuracy of the bounds. The coverage probability of a specific bound is the proportion of replications for which a bound is greater than or equal to the true population error amount. A bound is considered unreliable if its coverage is significantly different from \(1-\alpha =0.95\). The observed mean of a bound \(b\) is denoted by Mean\((b)\), with \(b=b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})},\,{b}_{\alpha },\,{\widetilde{b}}_{\alpha }\) or \(b_{\alpha }^{(\scriptscriptstyle { S })}\). In the tables, we report the value of Mean\((b)/{{Y}}_{0.95}\), where \({{Y}}_{0.95}\) denotes the 95% quantile of the observed distribution of \({{\widehat{Y}}}_{n}\). The quantities Mean\((b)/{{Y}}_{0.95}\) give unit free values which are usually close to 1. The uncertainty of the bound is measured by the observed standard deviation (s.d.) of the bounds. In the tables, we have the relative efficiencies \(s.d.({b}_{\alpha })/s.d.(b_{\alpha }^{(\scriptscriptstyle { S })})\) and \(s.d.({\widetilde{b}}_{\alpha })/s.d.(b_{\alpha }^{(\scriptscriptstyle { S })})\). A relative efficiency larger (smaller) than 1 indicates that \({b}_{\alpha }\) is less (more) stable than \(b_{\alpha }^{(\scriptscriptstyle { S })}\). We also compute the decile ranges of \({b}_{\alpha }/b_{\alpha }^{(\scriptscriptstyle { S })}\) and \({\widetilde{b}}_{\alpha }/b_{\alpha }^{(\scriptscriptstyle { S })}\) which assess the variation of \({b}_{\alpha }\) and \({\widetilde{b}}_{\alpha }\) with respect to \(b_{\alpha }^{(\scriptscriptstyle { S })}\). It will reveal that the empirical likelihood bounds are often lower and approximately proportional to \(b_{\alpha }^{(\scriptscriptstyle { S })}\).

Table 1 Positive taints \(\sim U(t_{L},t_{U})\) and \(N = 10{,}000\)
Table 2 \(t_i\sim U(t_{L},t_{U})\) and \(N=700\) or 1000

In Table 1, the sampling fraction n/N is small. The coverage of \(b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})}\) is significantly smaller than \(95\%\) and \(b_{\alpha }^{(\scriptscriptstyle { S })}\) gives large coverages. On average \({b}_{\alpha }\) is slightly larger than \(b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})}\) and smaller than \(b_{\alpha }^{(\scriptscriptstyle { S })}\). The bounds \({b}_{\alpha }\) and \(b_{\alpha }^{(\scriptscriptstyle { S })}\) have similar standard deviations. With \(n=100\), the standard deviations \({b}_{\alpha }\) is slightly larger. Some coverages of \({b}_{\alpha }\) may be significantly different from \(95\%\). With \(n=100\), and \(r= 2\%\), about \(13\%\) of the samples contains only zero values for \(y_i\). Those samples have been ignored when computing the coverages. With \(n=500\) and \(r=10\%\), we observe large coverages significantly different from \(95\%\), because in this case \(n/N=0.05\) is small but not negligible enough, leading to a more conservative bound \({b}_{\alpha }\). With \(n/N=0.05\), the bound \({\widetilde{b}}_{\alpha }\) is more suitable and should have better coverage.

Table 3 \(t_i\sim U(t_{L},t_{U})\) if \(z_i> {Z}_{1-r}\) and \(t_i=0\) otherwise

In Table 2, we consider \(n=200\) with \(N=1000\) or 700; that is, n/N is not negligible. Usually, the bound \(b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})}\) has a low coverage and \(b_{\alpha }^{(\scriptscriptstyle { S })}\) has a large coverage. The bound \({b}_{\alpha }\) is too conservative with \(100\%\) coverage, but \({b}_{\alpha }\) is usually smaller than \(b_{\alpha }^{(\scriptscriptstyle { S })}\), because Mean\(({b}_{\alpha })\leqslant \) Mean\((b_{\alpha }^{(\scriptscriptstyle { S })})\). The coverages of \({\widetilde{b}}_{\alpha }\) are closer to the nominal value, because the effect of the Hájek’s (1964) corrections \({q}_{i}\) are more pronounced than with \(N= 10\,000\), in Table 1. However, some coverages are still significantly different from \(95\%\). The bound \({\widetilde{b}}_{\alpha }\) is smaller than \({b}_{\alpha }\). The bound \({\widetilde{b}}_{\alpha }\) is mostly smaller than \(b_{\alpha }^{(\scriptscriptstyle { S })}\) because the upper deciles are less than 1. The bound \({\widetilde{b}}_{\alpha }\) is more stable than \(b_{\alpha }^{(\scriptscriptstyle { S })}\), because we observe a smaller s.d. for \({\widetilde{b}}_{\alpha }\). With \(N=1000\), the bound \({b}_{\alpha }\) is only slightly more stable than \(b_{\alpha }^{(\scriptscriptstyle { S })}\).

For the next series of simulations, we consider the situation when the errors are only in the right tail, which could be the case with fraudulent behaviour. Let \({Z}_{1-r}\) denote the \(1-r\) quantile of \(z_i\) generated from (12), where \(r\) denotes the error rate. We generate \(t_i\) randomly from uniform distributions \(\text{ Un }(t_{L},t_{U})\), when \(z_i> {Z}_{1-r}\). If \(z_i\leqslant {Z}_{1-r}\), we set \(t_i=0\). We consider \(r=2\%,\,5\%\) and \(10\%\). The ranges are \([t_{L},t_{U}]=[0.1,0.3],\,[0.2,0.7]\) and [0.5, 0.7]. Since the errors are in the right tail, we expect a strong correlation between \({\pi }_{i}\) and \(y_i\). The results are given in Table 3. The coverages of \({b}_{\alpha }\) are larger than with \({\widetilde{b}}_{\alpha }\). Usually, we observe coverages closer to \(95\%\) with \({\widetilde{b}}_{\alpha }\). The Stringer bound \(b_{\alpha }^{(\scriptscriptstyle { S })}\) has a very large coverage and is usually larger than \({\widetilde{b}}_{\alpha }\). The coverage of \(b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})}\) is smaller than \({\widetilde{b}}_{\alpha }\), when \(n=100\). Most coverages of \(b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})}\) are not significantly different from \(95\%\), when \(n>100\). The bound \({b}_{\alpha }\) is more conservative than \({\widetilde{b}}_{\alpha }\), even when n/N is negligible, because of the following reasons. Some \({\pi }_{i}\) can be large even when n/N is negligible; thus some \({q}_{i}\) can be very different from 1 for the units that are more likely to be selected. Furthermore, the correlation between \(y_i\) and \({\pi }_{i}\) makes \({\widetilde{b}}_{\alpha }\) less conservative, because the self-normalising property of \({\widetilde{\ell }}({Y})\) implies that \({\widetilde{\ell }}({Y})\) can be approximated by a quadratic form (Berger and Torres 2016) involving a small variance because of \({q}_{i}\) and the correlation between \(y_i\) and \({\pi }_{i}\) (e.g. Rao 1966). The bound \({\widetilde{b}}_{\alpha }\) seems to be the most appropriate.

Table 4 \((1-\gamma )r\%\) taints generated from a \(\text{ Beta }(2,5)\) distribution and \(\gamma r\%\) taints are equal to 1
Table 5 \(r\) denotes the % of \(t_i>0\), and \(\nu \) represents the % of \(t_i<0\) generated from a \(\text{ Beta }(2,)\) distribution

Now, we consider the situation when we have 100% overstatement for some items; that is, we allow \(t_i=1\), for some i. We also consider a right-skewed beta-distribution for \(0<t_i<1\). Consider \((100-r)\%\) of \(t_i\) are equal zero and \((1-\gamma )r\%\) taints generated randomly from a \(\text{ Beta }(2,5)\) distribution. The remaining \(\gamma r\%\) taints are equal to one. We consider \(N=10,000\), with \(n=100,\ 200\) and 500. The error rates are \(r=2\%,\,5\%\) and \(10\%\). The fraction \(\gamma \) of taints equal to one among the \(t_i>0\), is \(\gamma =10\%,\,20\%\) or \(40\%\). Systematic sampling is used, with probability proportional to \(z_i\). The results are given in Table 4. We also observe low coverages for \(b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})}\) and a large coverage for \(b_{\alpha }^{(\scriptscriptstyle { S })}\).

By comparing Table 4 with Table 1, we see that we have a lower coverage for \({b}_{\alpha }\) with \(n=100\), when \(r=2\%\) or \(5\%\). In these situations, the bound \({b}_{\alpha }\) has larger s.d. than \(b_{\alpha }^{(\scriptscriptstyle { S })}\). We have Mean\((b_{\alpha }^{(\scriptscriptstyle { S })})/{{Y}}_{0.95}<2\). In Table 1, this ratio can be larger than 2 for \(r=2\%\). With \(r=10\%\), the coverage of \({b}_{\alpha }\) is the closest to \(95\%\). The fraction \(\gamma \) of \(t_i=1\) does not seem to affect the precision and the coverage of \({b}_{\alpha }\).

For the last series of simulation, we consider understatements; that is negative taints. We follow approximately Clayton and McMullen’s (2007) simulation setup. Now, \(r\) denotes the fraction of \(t_i>0\), and \(\nu \) represents the fraction of \(t_i<0\), which are given by \(t_i=-a_i\), with \(a_i\) generated randomly from a \(\text{ Beta }(2,5)\) distribution. The fraction of \(t_i=0\) is given by \((100-r-\nu )\%\). We have \((1-\gamma )r\%\) taints between 0 and 1, following a \(\text{ Beta }(2,5)\) distribution. The fraction of \(t_i=1\) is \(\gamma r\%\). We consider \(\gamma =20\%\), \(N=10,000\) and \(n=200\). The fraction of \(t_i>0\) is \(r=2\%,\,5\%\) or \(10\%\). The fraction \(\nu \) of \(t_i<0\) is \(\nu =2\%,\,5\%\) or \(10\%\). Systematic sampling is used, with probability proportional to \(z_i\). The results are given in Table 5.

For Tables 4 and 5, the positive taints are generated the same way with \(\gamma =20\%\). The differences observed between Tables  4 and 5 can be just due to the negative taints. We notice that the coverage of \(b_{\alpha }^{(\scriptscriptstyle { S }\scriptscriptstyle { O })}\) can be lower than \(95\%\) and decreases with \(\nu \), because the Stringer offset bound \(b_{\alpha }^{(\scriptscriptstyle { S }\scriptscriptstyle { O })}\) is used. The offset reduces the bound and is more pronounced with large \(\nu \). We observe large coverages for \(b_{\alpha }^{(\scriptscriptstyle {{{\mathcal {N}}}})}\). The coverages of \({b}_{\alpha }\) are the closest to \(95\%\), in all cases. We observed lower coverages in Table 4 for \(n=100\), because the distribution of the taints is more skewed than in Table 5. Note that we have smaller s.d for \({b}_{\alpha }\) compared to \(b_{\alpha }^{(\scriptscriptstyle { S }\scriptscriptstyle { O })}\). The large values of Mean\((b)/{{Y}}_{0.95}\) observed in Table 5 are due to the fact that \({{Y}}_{0.95}\) can be close to zero.

In Table 2, the number of units with \({\pi }_{i}=1\) is 46 with \(N=700\) and 33 with \(N=1000\). With \(N=10\,000\) and \(n=500\), we only have 5 units with \({\pi }_{i}=1\) (Tables 13, 4 and 5). For \(n=100\), 200 and \(N=10\,000\), we have \({\pi }_{i}<1\) for all i. These numbers are the same for different distribution of taints, because we use the same \(z_i\) generated by (12), for a given N. We expect \({\widetilde{b}}_{\alpha }\) to be noticeably lower than \({b}_{\alpha }\), when the number of units with \({\pi }_{i}=1\) is large. This is what we observe in Table 2.

5 Conclusions

Our simulation study confirms that the naïve bound based on the central limit theorem can be too small, with an observed coverage significantly lower than the nominal level. On the other hand, the Stringer bound is too conservative, with a coverage close to \(100\%\), unless we have understatements. The empirical bounds proposed have coverages closed to the nominal level and usually lies between the naïve bound and the Stringer bound. The penalised empirical likelihood bound described in §3.2 seems to be the most appropriate, because it takes into account of the sampling fraction and possible correlation between the error and the selection probabilities. For example, when the errors are mainly within the tail of the recorded values, better bounds are obtained with the penalised empirical likelihood approach, even with small n/N. We recommend using the penalised empirical bound, because it has better observed coverages and may be more stable than the stringer bound.

Both empirical likelihood bounds have the advantage of respecting the confidence level and of being less conservative than the Stringer bounds. However, they are more numerically intensive than the Stringer bound, because they rely on a Lagrangian parameter. The approach based on p-values is a simpler alternative to check if total error exceed a tolerable error amount. We need to compute \(\ell ({{\mathcal {A}}})\) which requires solving (5) with a root-search method, to obtain the value of \({\varvec{\eta }}\) for \({Y}={{\mathcal {A}}}\). Once \({\varvec{\eta }}\) is known, \(\ell ({{\mathcal {A}}})\) can be computed from (6). The analogue to (5) for the penalised version of Section 3.2 can be easily derived. Computing the bound \({b}_{\alpha }\) (or \({\widetilde{b}}_{\alpha }\)) is more numerically intensive than the approach based on p-values, because it involves \({\varvec{\eta }}\) for different values of \({Y}\), in order to solve (7) (or (10)).

Like the Stringer bound, both empirical likelihood bounds may not be suitable with very small sample sizes. It cannot provide a bound for samples containing no errors. It can be unstable with sample containing a tiny amount of errors. Recent empirical likelihood approaches tackle the former (e.g. Chen et al. 2003, 2008; Jing et al. 2017) but the are not designed to handle unequal probability sampling, used in MUS. It would be useful to investigate how these extensions can be used under MUS.