Abstract
Optimal sampling designs for audit, minimizing the mean squared error of the estimated amount of the misstatement, are proposed. They are derived from a general statistical model that describes the error process with the help of available auxiliary information. We show that, if the model is adequate, these optimal designs based on balanced sampling with unequal probabilities are more efficient than monetary unit sampling. We discuss how to implement the optimal designs in practice. Monte Carlo simulations based on audit data from the Swiss hospital billing system confirms the benefits of the proposed method.
Introduction
We consider audit plans that are periodically repeated. The specific case that we have in mind is “hospital bill audit”; however, other types of audit, e.g., periodic internal controls of tax assessment of a governmental revenue service, share similar features. Past audit experience provides the auditor with useful information to assess the risk of a misstated transaction in the next audit; this risk usually depends on the transaction type. For example, hospital bills issued at the end of each stay depend on the medical services delivered and empirical evidence shows that the risk of error in an invoice depends on the complexity of these services.
Often, because of the impractical and costly effects of examining an entire population, only a sample of transactions is audited. In this case, audit sampling procedures can provide accurate and unbiased estimates of the total amount of errors in less time and with less cost than a complete investigation. However, estimation by means of sampling audit implies a sampling risk due to the random selection of transactions. Different methods of controlling this risk have been proposed. Current auditing practice is still using the simple random sampling. A more efficient approach is ranked set sampling (Gemayel et al. 2011). However, the most relevant methodology—endorsed by the American Institute of Certified Public Accountants (AICPA 2008)—is monetary unit sampling (MUS). In MUS, also known as Dollar unit sampling, the probability that a transaction is included in the sample is proportional to the value of the main variable that has to be checked by the auditor, e.g., the amount of the transaction. Thus, large transactions are overrepresented in the sample and, for this reason, a “weighted estimate” of the total error amount has to be used in the audit report. MUS is usually attributed to Stringer (1963) and is described among others in Leslie et al. (1980), Gafford and Carmichael (1984), Tsui et al. (1985), Smieliauskas (1986), Grimlund and Felix (1987), Hansen (1993). MUS is an application to accounting of systematic sampling with unequal inclusion probabilities, as proposed by Madow (1949).
In this paper, we show that formal statistical evaluation of past experience, can be used to optimize the sampling design. More precisely, we will develop sampling designs that optimize the accuracy of total error estimate. Therefore, an optimal design will require a smaller sample size with respect to any other design in order to attain the same level of accuracy.
In order to develop the optimal designs, we assume that the error generating process can be formally modeled. We consider a twostage model: the first stage describes the probability of misstatement as a function of the available auxiliary covariables; the second stage describes the relationship between the error amount and the covariables. Since large errors are usually associated with large transactions and small errors with small transactions, the second stage model is heteroscedastic. It turns out that, as with MUS, the optimal sampling strategy overselects large transactions. However, the inclusion probabilities are different than those of MUS because they take the auxiliary information into account. The use of auxiliary information has already been recommended in the audit literature with the purpose of improving the estimates (see e.g., Kaplan 1973) and the sampling designs (Hoogduin et al. 2015).
The first part of the paper defines the model of the error generating process and two estimators of the total amount of errors. Then, the optimal designs are introduced. We also introduce a simple simulated example of hospital bill audit with artificial data and will continually relate to it in order to illustrate the concepts, the models, the theoretical results, and their application. The second part of the paper outlines the general implementation of the proposed sampling strategies in practice. The implementation is illustrated with the help of real data from a Swiss hospital bill audit program. In the Sect. 8, we comment on the general applicability of the models and conclude that, if the audit error generating process can be modeled with the help of auxiliary information, better sampling designs than MUS can be implemented. An “Appendix” provides all mathematical proofs and complete theoretical results. It also describes a large Monte Carlo simulation that confirms the theoretical results. The computer programs and the data used in the example can be obtained from the authors.
The statistical model
In auditing, a set of transactions (accounts, bills, records) are controlled by the auditor. Let k be the label of the kth transaction and consider the reference population \(U=\{1,\dots ,k,\dots ,N\}\), i.e., the set of all transactions to be checked. Let \(x_k\) be the written value (monetary amount) of transaction k that must be controlled by the auditor and let \(y_k\) be the corresponding audit amount. The audit amount is only known after the control.
Hospital bill audit example: a short introduction In modern hospital management, case billing is based on a patient classification system (PCS), i.e., a set of rules which ascribe each individual stay to a group according to the patient’s characteristics. Groups—usually called DiagnosisRelated Groups or DRGs (Fetter et al. 1980)—depend on diagnoses and treatments and are as homogeneous as possible with respect to resource consumption. A standard price of stay is associated with each group and reported on the patient invoice at the end of the stay. The standard prices are usually fixed yearly on the grounds of national statistics on the observed stays. The quality of the data is clearly a crucial feature of the system and audit plans have been introduced by national health services in order to determine whether the data in the health records correctly document the services listed on the invoices (see e.g., AAMAS 2009). Many audit companies listed on the Internet provide “medical bill auditing services”. The main purpose is to detect errors and fraud: errors can be accidental (i.e., due to difficulties in coding complicated cases) or intentional (i.e. “overcoding” of inadequate procedures with the purpose of inflating the bill). Both positive and negative differences between written and audit values are important: positive differences are a concern for insurance companies because they represent incorrect claims; negative differences worry the hospital because they are losses in revenue.
Hospital bill audit example: artificial data In this simplified example, we suppose that the auditor is required to audit an hospital H with a population U of \(N=75\) patients and that a hypothetical exhaustive control of the invoices would bring out 27 “errors”. Such a very small population is clearly not typical of real hospital bill audits. Our purpose is however to completely show the data and the main computations in the single page sheet of Table 1. Column k indicates the case number, x the amount written on the invoice, and y the audit amount (in some monetary units). Column d contains the differences \(x_ky_k\) and column t the ratios \(t_k=(x_ky_k)/x_k\). In certain audit applications, \(t_k\) is called the “taint” of transaction k. In Fig. 1 (left panel), a plot of y against x shows that both the amount and the variability of the differences \(x_{k}y_{k}\) are roughly proportional to the transaction values, i.e., the relationship between the errors and the values is heteroscedastic. This is a typical statistical feature of audit data. The remaining columns of the sheet will be described below.
We will consider two general statistical models for the error generating process. Both of these models consist of two stages: the first stage is a logistic model that describes how the probability of occurrence of an error depends on the available information; the second stage describes the relationship of the error amount with the available information. The first stage is the same for both models. At the second stage, if there is an error, the amount of this error is proportional to the written value and follows a linear heteroscedastic model. This feature of the data is common to many audit populations. It is the standard statistical justification of stratified sampling and sampling with inclusion probabilities proportional to the written values such as MUS. In the accounting literature, it has also been discussed in relation with the process of making comparative judgments of numerical information; see, for example, Dickhaut and Eggleton (1975), who investigate the role of the Weber–Fechner law in predicting materiality judgments. Rosenberg et al. (2000) propose a statistical model to detect DRG overcoding, which has common features with our model.
We assume that the auxiliary information about the transactions is available in the form of two vectors of covariables \({\mathbf{v}}_k\in {\mathbb {R}}^q\) and \({\mathbf{z}}_k\in {\mathbb {R}}^p\). Vector \({\mathbf{v}}_k\) contains the values of q covariables that are used to describe the occurrence of an error in transaction k. Vector \({\mathbf{z}}_k\) contains the values of p covariables that describe—in case of error—the amount of the error in transaction k. The covariables can be quantitative or qualitative. In hospital bill audits, the covariables usually describe medical complexity; in financial accounting, the covariables may describe different aspects of the transactions such as manufacturer, provider, type of product delivered, customer, or seller characteristics. According to the usual practice in the statistical literature, the notations x, y, \({\mathbf{v}}\), \({\mathbf{z}}\), etc., without the suffix k will be used for the variable names.
First stage model M0
The occurrence of the errors is described with the help of N independent binary variables \(j_k\) such that:
where
This logistic model is interpreted as follows: if \(j_k=1\), the written value \(x_k\) is incorrect, otherwise it is correct. The probability that \(x_k\) is incorrect is \(\psi _k\). The vector \(\gamma \in {\mathbb {R}}^q\) contains the “logistic regression coefficients”. Since \(j_k\) is a Bernoulli variable, its expected value under the model is \(\mathrm{E}_M(j_k)=\psi _k\), and its variance under the model is \(\mathrm{Var}_M(j_k)=\psi _k(1\psi _k)\). An equivalent expression of (1) is
where \(\text{ logit } \psi _k = \log \left( { \psi _k }/{(1 \psi _k)}\right)\).
Second stage model M1
We assume that, if there is an error in \(x_k\), the amount of the error is proportional to the written amount, i.e.,
where \(j_k\) is defined in model M0. The variables \(\varepsilon _k\) are independent and such that \(\mathrm{E}_M(\varepsilon _k)=0\) and \(\mathrm{Var}_M(\varepsilon _k)=\sigma ^2\). Moreover, \(j_k\) and \(\varepsilon _k\) are independent. The vector \(\beta \in {\mathbb {R}}^p\) contains the “linear regression coefficients”. The model can be interpreted as follows. If \(j_k=0\), the value \(x_k\) is correct and \(y_k=x_k\). If \(j_k=1\), the “error” \(y_kx_k\) is proportional to the written amount \(x_k\). We say that M1 is a heteroscedastic model. According to M1, the taints \(t_{k}=(x_{k}y_{k})/x_{k}\) satisfy
If a sample of transactions is made available by previous audits, M0 and M1 can be estimated separately. To estimate M1, only the incorrect transactions are used.
Hospital bill audit example: the model We assume that past audits of hospital H provided the following information.

(a)
The relation between y and x is of the heteroscedastic type shown in Fig. 1.

(b)
It is possible to ascribe the 75 invoices to three groups according to the medical complexity of the services delivered: “1: simple” , “2: average” , “3: high” . More precisely, invoices 1 to 20 are in group 1, invoices 21 to 50 in group 2, invoices 51 to 75 in group 3. The probability \(\psi _{k}\) that an invoice will have to be corrected is \(10\,\%\) in group 1, \(30\,\%\) in group 2, \(50\,\%\) in group 3.

(c)
The values of \(t_{k}\) are also group dependent. Among the misstatements, their expected values are \(0.025\) in group 1, 0.05 in group 2, and 0.20 in group 3.
The most common statistical model to describe the relationship between probabilities and a set of covariables is a logistic model (see, e.g., Hosmer and Lemeshow 2013) and we now show that the probabilities mentioned in (b) can be described by formula (1). For this purpose, we define the three (\(q=3\)) explanatory variables
where \(k=1,\ldots ,75\) and I denotes the “indicator function”: for example, \(I(k\in\) group 1 \()=1\) if k is in group 1 and \(=0\) otherwise. These variables, also called “indicator variables”, are used to represent the qualitative information “group belonging”. Using vector notations \({\mathbf {v}}_{k}^{\rm T}=(v_{k,1},v_{k,2},v_{k,3})\) and \(\gamma =(\gamma _{1},\gamma _{2},\gamma _{3})^{\rm T}\) we have
and using \(\gamma _{1}=\log (0.1/0.9)\), \(\gamma _{2}=\log (0.3/0.7)\), \(\gamma _{3}=\log (0.5/0.5)\), formula (1) is equivalent to \(\psi _{k}=0.1\) for k in group 1, \(\psi _{k}=0.3\) for k in group 2, and \(\psi _{k}=0.5\) for k in group 3.
Usually, \(\gamma\) is estimated from available data with the help of a logistic regression program. In this simple example, the probabilities \(\psi _k\) can be directly estimated with the frequencies of mistakes observed in previous audits.
Also note that the average values of \(t_k\) mentioned in (c) are given by a simple linear regression model with one explanatory and one dummy variable (\(p=2\)):
for the misstatements (i.e., \(j_k=1\) in M1). If \(\beta _0=0.10\), \(\beta _1 = 0.15\), \(z_{k}=0.5\) for k in group 1, \(z_{k}=1\) for k in group 2, \(z_{k}=2\) for k in group 3, and \(E(e_{k})=0\), this model provides the averages \(0.025\), 0.05, 0.20 for the three groups. We will also assume that the standard error \(\sigma\) of \(e_{k}\) has been estimated at 0.10.
It turns out that, in certain applications such as the one described in Sect. 6.1, one observes an approximate linear relationship with homogeneous variance between the logarithms of \(y_k\) and \(x_k\). In these cases, it may be convenient to replace M1 with the following model M2 because the estimation of M2 will be more accurate.
Second stage model M2
In model M2, we assume that, if there is an error, the amount of the error is governed by a linear model on the logarithms of the values:
Vector \(\beta \in {\mathbb {R}}^p\) contains the regression coefficients. The variables \(\varepsilon _{k}\) are assumed to be independent and normally distributed with \(\mathrm{E}_{M}(\varepsilon _{k})=0\), \(\mathrm{Var}_{M}(\varepsilon _{k})=\sigma ^2.\)
We note that formula (2) can also be written as a heteroscedastic model:
where \(\mathrm{E}_M(r_k)=0\), \(\mathrm{Var}_M(r_k)=v_k^2\), and
To estimate M2, only the incorrect transactions in a sample from U are used.
Quantities of interest
In most audits, we are mainly interested in determining the total amount of the misstatements
where
are the population total written value and total audit value, respectively. We are also interested in the total number of errors defined by
We say that Y, D, and J are “quantities of interest” and that y, \(d=xy\), and j are “variables of interest”. Before auditing, all the values \(x_k\) are known but the values \(y_{k}\) are unknown. Therefore, Y, D, and J are unknown quantities, whereas X is known. After auditing, some of the values \(y_k\) and \(j_k\), i.e., the audit values, are known. The quantities Y, D, and J remain unknown but they can be estimated using the audit values.
Hospital bill audit example: quantities of interest For the artificial population shown in Table 1, we have \(X=8442.93\), \(Y=8138.27\), \(D=304.66\), and \(J=27\). Before auditing, X is known to the auditor; however Y, D, and J are unknown to the auditor and have to be estimated using a sample of transactions. X and Y, called the original and the revised “casemix”, are important quantities in hospital management.
Sampling designs and estimators
A sample S of size n is a subset of n elements of U. To randomly select the sample we use a “sampling design” and an associated sampling algorithm, i.e., a rule of selecting the transactions from the population. A sampling design is a well known concept in the sampling literature (see e.g., Tillé 2006). It specifies the “inclusion probabilities” \(\pi _k=\Pr (k\in S)\) for \(k=1,\ldots ,N\): for each k, \(\pi _k\) is the probability that transaction k be included in the sample S. We assume that \(\pi _k>0\) for all \(k\in U\). For example, in simple random sampling (SRS) without replacement, \(\pi _k=n/N\) for all k.
In general, the inclusion probabilities may however differ from one another. In MUS, the inclusion probabilities are proportional to \(x_k\), i.e.,
Usually, it is assumed that \(x_k\le X/n\) for all k. Note that, if this condition is not fulfilled, it is always possible to compute inclusion probabilities proportional to \(x_k\), where the largest units have a probability equal to 1 (see e.g., Tillé 2006, 18–19). Algorithms to draw random samples according to SRS or MUS can easily be implemented into common spreadsheets.
Hospital bill audit example: samples We suppose that, in order to minimize resources, a sampling audit based on a sample of size \(n=30\) is considered. Column p1 in Table 1 contains the inclusion probabilities of a SRS design and column p2 contains the inclusions probabilities of a MUS design. Columns s1 and s2 define two specific samples drawn according to SRS and MUS. Units such that \(s1=1\) are taken into the SRS sample and units such that \(s2=1\) are taken into the MUS sample. The remaining units (\(s1=0\) and \(s2=0\)) do not belong to any sample.
Estimators
When the inclusion probabilities are unequal, estimators must be adjusted for over and underrepresented sampled transactions in order to remove the bias. A wellknown unbiased estimator of Y is the Horvitz–Thompson (HT) estimator (Horvitz and Thompson 1952). The HT estimator of Y is the sample weighted sum
Under SRS, this estimator coincides with the familiar expression \(N\sum _{k\in S} y_k/n\). There are several possibilities to estimate D. The two main estimators are the “direct estimator”
and the “difference estimator”
where \(\hat{X} = \sum _{k\in S} x_k/\pi _k\) is the HT estimator of X. Note that, although X is known, it can be estimated using the sampled book values. A very nice feature of probabilities proportional to \(x_k\) is that
i.e., the estimator coincides with the population value. When \(\hat{X}=X\), we say that “the sample is balanced on x”. Therefore, the MUS design is balanced on x and \(\hat{D}_1=\hat{D}_2\) under MUS. Finally, the HT estimator of J is
Hospital bill audit example: estimates We have
Note that \({\hat{X}}_{MUS}=X\) and \(\hat{D}_{1,MUS}=\hat{D}_{2,MUS}\).
Balanced sampling
Consider a quantity of interest, say D, and suppose that the components \(z_k,\ldots ,z_p\) of \({\mathbf{z}}^{\rm T}=(z_{1},\ldots ,z_{p})\) are very correlated with the corresponding variable of interest, say d. A sample is said to be “balanced on \(z_{1},\ldots ,z_{p}\)” or briefly “balanced on \({\mathbf{z}}\)” if the HT estimators of the component totals equal the population totals, i.e.
It is known that this kind of sample provides a very accurate HT estimator of the quantity of interest.
Deville and Tillé (2004) have proposed an algorithm, named “the cube method”, that enables the selection of random samples according to assigned inclusion probabilities that are also balanced on one or more variables. They have shown that the balancing Eq. (7) can be approximately satisfied for any set of inclusion probabilities. This algorithm has been made available in the specialized public domain R package sampling described in Tillé and Matei (2012).
Optimal designs
This section provides optimal designs to estimate the total error amount D under the models introduced above. The designs minimize the anticipated mean squared error (MSE) of the estimator for a fixed sample size. The anticipated \(\mathrm{MSE}\) of an estimator \({\hat{D}}\) is defined as
i.e., the expected squared difference between the estimator and the quantity D of interest. The expectation is computed both with respect to the sampling design (\(\mathrm{E}_p\)) and the model (\(\mathrm{E}_M\)). We also provide an optimality result for estimating the total number of errors J. The results are obtained using the methodology developed in Nedyalkova and Tillé (2008). Complete mathematical derivations can be found in the “Appendix”.
Result 1: optimal design for model M1
The sampling design that minimizes the \(\mathrm{MSE}\) of \(\hat{D}_1\) is balanced on x and \(x \psi {\mathbf{z}}\) and has inclusion probabilities proportional to
The sampling design that minimizes the \(\mathrm{MSE}\) of \(\hat{D}_2\) has the same inclusion probabilities; however, it must only be balanced on \(x \psi {\mathbf{z}}\).
Pratically, the sample can be selected by means of the cube method in such a way that
and
Result 2: optimal design for model M2
The sampling design that minimizes the \(\mathrm{MSE}\) of \(\hat{D}_1\) is balanced on x, \(x\psi\), and \(x\psi a\) and has inclusion probabilities proportional to
The sampling design that minimizes the \(\mathrm{MSE}\) of \(\hat{D}_2\) has the same inclusion probabilities; however, it must only be balanced on \(x\psi\), and \(x\psi a\).
Practically, the sample can be selected by means of the cube method in such a way that
Result 3: optimal design for estimating the number of errors
The sampling design that minimizes the \(\mathrm{MSE}\) of \(\hat{J}\) is balanced on variable \(\psi\) and has inclusion probabilities proportional to \(\sqrt{ \psi _k (1\psi _k) }\).
From Result 1 it follows that, in general, MUS is not an optimal design because its inclusion probabilities do not take the auxiliary information into account. From Result 2 it follows that the optimal sampling design for estimating J is very different from the optimal designs for \({\hat{D}}_1\) and \({\hat{D}}_2\).
Calibration
Balancing a sample is an integer problem because each unit is taken or not taken in the sample. The balancing equations can therefore rarely be exactly satisfied and the sample can only be approximately balanced. This complication is called the “rounding problem” in Deville and Tillé (2004). Rounding errors of the balancing procedures may increase the empirical \(\mathrm{MSE}\) of the optimal estimators and cancel out the beneficial effect of the optimal designs.
As a remedy to this shortcoming, Deville and Tillé (2004) recommend using a balanced design followed by “calibration” (e.g., Deville and Särndal 1992) on the same auxiliary variables. For example, when using the optimal design for model M2 and \({\hat{D}}_1\), the ordinary weights used in the HT estimators are the inverse of the inclusion probabilities, i.e., \(f_k=1/\pi _k\). These weights are replaced by the “calibrating weights” \(w_k\) which are computed in such a way that they are as close as possible to the \(f_k\)’s and satisfy the calibrating equations
and
The calibrated estimators of Y and D are \(\tilde{Y}=\sum _{k\in S}w_k y_k\), \(\tilde{D}_{1}=X \tilde{Y}\), and \(\tilde{D}_{2}=\sum _{k\in S}w_k(x_ky_k)\). Clearly, calibration on x implies \(\tilde{D}_{1}=\tilde{D}_{2}\) and we will call \({\hat{D}}\) their common value. The “closeness” between \(f_k\) and \(w_k\) are defined in Deville and Särndal (1992) by means of a large family of pseudodistances, which define several types of calibration procedures. Calibration algorithms can be found in the software Sampling mentioned above.
Hospital bill audit example: optimal sampling design Condition (a) mentioned at the beginning of the example suggests the use of a sampling design with inclusion probabilities proportional to \(x_k\), i.e., of MUS. However, we will show that, using information (b) and (c), we can improve the accuracy of the calibrated estimate \(\hat{D}\) of the total error amount D, both w.r.t. SRS and MUS. Since the example uses model M1 the optimal inclusion probabilities \(\pi _{k}\) are proportional to \(x_{k}\sqrt{\psi _{k}\left[ \sigma ^{2}+\left( z_{k}\beta \right) ^{2}\left( 1\psi _{k}\right) \right] }\). The \(\pi _k\) values can be found in Table 1, column p3 and are plotted against \(x_k\) in Fig. 1 (right panel). They are proportional to \(x_k\), but the strength of the proportionality depends on the error probability. In addition, the sample is balanced and calibrated on x and \(x\psi z\). Balancing and calibration have been obtained with the help of the R package Sampling mentioned above. The sample is characterized by column s3 in Table 1. We obtained the following estimates:
Hospital bill audit example: a Monte Carlo simulation We used a Monte Carlo simulation to compare the following sampling strategies:

cSRS: simple random sampling without replacement, calibrated on x,

OPTD: optimal design balanced and calibrated on x and \(x\psi z\),

MUS: naturally balanced on x,

iMUS: improved MUS balanced on \(x\psi\) and \(x\psi z\), calibrated on x, \(x\psi\) and \(x\psi z.\)
For each design we simulated 5, 000 samples of size 30 drawn from the invoice population U. For each sample, we computed the estimate \(\hat{D}\) (which coincides with \({\hat{D}}_1\) and \({\hat{D}}_2\)). OPTD was based on Results 1. Calibration was computed according to the raking ratio technique (Deville and Särndal 1992). Table 2 reports the empirical \(\mathrm{MSE}\) (mse) of \({\hat{D}}={\hat{D}}_1={\hat{D}}_2\). The simulation confirms the theory: OPTD is the best strategy. MUS is better than SRS but can be improved by balancing and calibrating.
Practical implementation
We now describe the necessary steps to implement the optimal procedure for estimating the total error amount using \({\hat{D}}\) according to Result 1, Result 2, and the calibration equations. The procedure to estimate the number of incorrect transactions is implemented in a similar way. The description is very general; in particular circumstances some of the steps can be simplified or even skipped. The next subsection, illustrates each step with the help of real data taken from an hospital bill audit. The implementation of the procedure in different industries follows the same steps, assuming that the following conditions are satisfied.
 C1::

The population of transactions U about which the auditor wishes to draw conclusions is clearly identified and the written values \(x_k\) that must be audited are well defined.
 C2::

Auxiliary information in the form of q covariables \(v_{1},\ldots ,v_{q}\), that may be related to the probability of an error in x is available for all transactions of U.
 C3::

Auxiliary information in the form of p covariables \(z_{1},\ldots ,z_{p}\), that may be related to the error amount in x is available for all transactions of U.
 C4::

The results of a previous sampling audit for the population U are available; they include the sampled values \(x_{k}\), the audit values \(y_{k}\), as well as the values \(v_{1k},\ldots ,v_{qk}\) and \(z_{1k},\ldots ,z_{pk}\) of the covariables for all audit cases.
 C5::

The model M0M1 or M0M2 describes the error generating process. The number of nonnull differences \(x_{k}y_{k}\) observed in the previous sample is sufficiently large to estimate model M1 or model M2. See Step 2, below.
Condition C1 is a basic requirement of any audit procedure. Conditions C2 and C3 mean that information associate with the risk and the magnitude of an incorrect value of the written value is available for each transaction to be audited. This information is coded in the form of quantitative or qualitative covariables that can be processed in a statistical package. Examples are mentioned at the end of Sect. 2 and in the next subsection. Conditions C3 and C4 state that values of these covariables have been observed in a previous audit of U. As we wrote in the introduction, we are focusing on audits that are regularly repeated and, in practice, the “state” of the population (i.e., the transactions belonging to U) varies from one audit to another. Conditions C3 and C4 mean therefore that data from an audit of a “previous state” of U are available. We suppose, however, that the same statistical model describes the various states of U. The data from the previous audit will then be used to estimate the model and to compute the inclusion probabilities for the next audit.
We finally assume that the size n of the sample to be drawn has been decided (see the note on sample size determination at the end of this section).
The implementation consists of the following steps.
 Step 1::

Estimate M0. For each transaction k of the previous audit define the indicator \(j_k\) so that \(j_{k}=1\) if there was a misstatement in \(x_k\) (i.e., \(x_{k}\ne y_{k}\)) and \(j_{k}=0\) otherwise. Then, with the help of a standard statistical software, fit the logistic regression model M0 with the responses \(j_k\) and explanatory variables \(v_{1},\ldots ,v_{q}\) obtaining the vector of coefficient estimates \({\hat{{{\gamma }}}}\).
 Step 2::

Estimate M1 or M2. First, analyze the subset of transactions of the previous audit such that \(x_{k}\ne y_{k}\) in order to select model M1 or model M2. Usually, graphical representations of \(y_k\) against \(x_k\) and of \(\log (y_k)\) against \(\log (x_k)\) suffice to take this decision. Then, using the data of this subset and a standard statistical software, fit the selected regression model with responses \(y_k\) and covariables \(z_{1},\ldots ,z_{p}\) obtaining the vector of coefficient estimates \({\hat{\beta }}\) and the scale estimate \(\hat{\sigma }\).
 Step 3::

Compute the inclusion probabilities for the next audit sample. Using the vector \({\hat{\gamma }}\) compute, for all transactions to be audited,
$$\hat{\psi }_{k}=\exp ({\mathbf {v}}_{k}^{\rm T} {\hat{\gamma })/}\left( {\mathbf {1}+}\exp ({\mathbf {v}}_{k}^{\rm T}{\hat{\gamma })}\right).$$In addition, if model M1 has been selected, use \({\hat{\beta }}\) and \(\hat{\sigma }\) to compute inclusion probabilities \({\hat{\pi }}_k\), which are proportional to
$$x_{k}\sqrt{\hat{\psi } _{k}\left[ \hat{\sigma }^{2} +\left( {\mathbf {z}}_{k}^{\rm T}{\hat{\beta }}\right) ^{2}\left( 1\hat{\psi }_{k}\right) \right] };$$otherwise, if model M2 has been selected, compute \(\hat{a}_{k}=\exp ({\mathbf {z}}_{k}^{\rm T} {\hat{\beta }}+\hat{\sigma }^{2}/2)\) and inclusion probabilities \({\hat{\pi }}_k\), which are proportional to
$$x_{k}\sqrt{ \hat{\psi }_k \left[ \exp ({\hat{\sigma }}^2)1 \right] {\hat{a}}_k+(1\hat{\psi }_k)({\hat{a}}_k1)^{2} }.$$The inclusion probabilities can be computed with the help of a specialized software (see e.g., Tillé and Matei 2012).
 Step 4::

Draw the next audit sample. With the help of the specialized software, draw a random sample of size n from the population U with inclusion probabilities \(\hat{\pi }_{k}\) in such a way that the sample is balanced and calibrated on the variables and x, \(x\hat{\psi }\), and \(x\hat{\psi }{\hat{a}}\) (model M2) or x and \(x\hat{\psi }{\mathbf {z}}\) (model M1).
Illustration with real data
Starting from 2012, a new PCS has been introduced in Switzerland, and a mandatory annual hospital bill audit has been implemented as a component of a data quality assurance program (SFAO 2014). The audit is based on a sample of invoices. We use a data set of 6043 stays whose records and invoices were checked in a 2011 pilot study. These stays were sampled from 44 participating hospitals (for each hospital, a SRS design was used).
Available auxiliary information (covariables) include patient characteristics such as diagnosis and treatment codes, the group (DRG) where he/she has been classified, as well as the hospital where he/she was admitted. The data and the detailed R scripts of our analysis can be obtained from the authors. In the following paragraphs, we summarize our main findings.
We first consider this data as our “previous sample” to be used in Step 1 and Step 2 of the implementation. The number of different DRG groups in this sample was 657; for 191 groups, at least one invoice was misstated. The total number of incorrect invoices was 317.
In order to build the logistic regression model M0, we first grouped the 44 hospitals into three classes \(\mathscr {H}_{1}\), \(\mathscr {H}_{2}\) and \(\mathscr {H}_{3}\) according to their frequencies f of wrong invoices (\(\mathscr {H}_{1}:f\le 2\,\%\), \(\mathscr {H}_{2}:2\,\%<f\le 12\,\%\), and \(\mathscr {H}_{3}:f>12\,\%\)). The classes were then characterized with the help of two indicator covariables: \(h_{1k}=I(\)stay k is in class \(\mathscr {H}_{1} )\), and \(h_{2k}=I(\)stay k is in class \(\mathscr {H}_{2})\) (and then \(h_{1k}=h_{2k}=0\), if stay k is in class \(\mathscr {H}_{3}\)). Among the 191 DRG groups with wrong invoices, 115 had more that \(10\,\%\) misstatements and were assigned to a class \(\mathscr {G}_1\) characterized with the help of the indicator covariable \(g_{k}=I(\)stay k is in \(\mathscr {G}_1)\). The remaining groups were assigned to a class \(\mathscr {G}_2\). Variables, such as length of stay, number of diagnoses, and number of treatments, were tested with the help of logistic regression analysis (see e.g. Hosmer and Lemeshow 2013) but were not significant. So, our estimated final model M0 included just 3 covariables: g, \(h_{1}\), and \(h_{2}\):
The coefficient estimates and their standard errors are reported in Table 3. The probabilities of misstatement shown in Table 4 can be computed according to this model. For example, the probability of error for an invoice k in class \(\mathscr {H}_{2}\) (\(h_{1k}=0\), \(h_{2k}=1\)) and class \(\mathscr {G}_2\) (\(g_{k}=0\)) satisfies (neglecting rounding errors)
i.e., \(\hat{\psi }_k = \exp ( 3.707 )/(1+\exp (3.707) )= 0.024\). The probability of a wrong invoice k in class \(\mathscr {H}_{3}\) (\(h_{1k}=h_{2k}=0\)) and class \(\mathscr {G}_1\) (\(g_{k}=1\)) satisfies
i.e., \(\hat{\psi }_k = \exp ( 0.139 )/(1+\exp (0.139) )= 0.466\).
We then built a model for the relationship between the audit values \(y_k\), the invoice values \(x_k\), and some available covariables, using the 317 stays such that \(y_{k}\ne x_{k}\). Figure 2 shows the 317 points \((x_{k},y_{k})\) both on the original and on loglog scales. The second plot has a better linear homoscedastic shape and suggests that model M2 is an adequate description of the relationship between x and y (the approximate normality of the residuals has also been checked). Significant covariables were the length of stay (l), the number of diagnoses (s) and the number of treatments (m) and the estimated model, including an interaction term between l and s, was
The parameter estimates and their standard errors were obtained by means of a standard multiple regression program and are reported in Table 3. According to this model, the expected value of \(\log (y_{k})\) increases with increasing \(l_k\), \(s_k\), and \(m_k\) as well as with \(\log (x_{k})\) [note that the coefficient of \(\log (x_{k})\) is \(10.519= 0.481\)]. These results are not surprising; however, the tiny negative interaction is difficult to interpret and could be removed from the model without markedly affecting the sampling designs and the audit estimators. We note that the HTestimators remain unbiased even if the model does not exactly fit the population U. Therefore, a sampling design based on an imperfect model can still be a valid design, although it loses some efficiency.
With the purpose of continuing our example and showing the performance of the optimal sampling designs, we then took a different point of view considering the same data set as a “reference population” U of \(N=6043\) transactions for which we knew both the written amounts \(x_k\) and the audit amounts \(y_k\). So, we knew \(J=317\), \(X=\sum _{k=1}^{N}x_{k}=5509.651\), \(Y=\sum _{k=1}^{N}y_{k}=5531.833\), and the difference \(D=XY=22.182\) (which means that the hospitals of our sample billed slightly less than they could have !). The totals D and J usually have to be estimated in practice. However, for this population, we were able to compare their true and estimated values.
After computing the inclusion probabilities \(\hat{\pi }_{k}\) according to Result 2, we draw 10,000 samples of size \(n=100\). For each sample, we computed the estimates \(\hat{D}\) and \({\hat{J}}\). The samples were drawn according to the following strategies:

cSRS: Simple random sampling without replacement, calibrated on x,

OPTD: Optimal design for D, balanced and calibrated on x, \(x\hat{\psi }\), and \(x\hat{\psi } a\),

OPTJ: Optimal design for J, balanced on \(\psi\), calibrated on x and \(\hat{\psi }\),

MUS: naturally balanced on x,

iMUS: improved MUS balanced on \(x\hat{\psi }\) and \(x\hat{\psi } a\), calibrated on x, \(x\hat{\psi }\) and \(x\hat{\psi } a.\)
OPTD and OPTJ were based on Results 2 and 3. Table 5 reports the empirical mean squared errors (mse) of \(\hat{D}\) and \(\hat{J}\). We notice that, OPTD was the best design for the estimation of D and OPTJ the best design for the estimation of J.
For comparison, the mse of the noncalibrated \({\hat{D}}_1\) and \({\hat{D}}_2\) under simple random sampling (SRS) were 40.53 and 1.15 respectively. Note that 40.53 is also the mse of \({\hat{Y}}\) under SRS. The strong negative correlation (\(.986\)) between \({\hat{X}}\) and \({\hat{Y}}\) explains the huge reduction in mse from \({\hat{D}}_1\) to \({\hat{D}}_2\).
A note on sample size
The usual approach to the determination of the sample size n starts with the choice of a level of accuracy. Then, the minimal sample size that guarantees this level is determined according to some mathematical procedure (Stringer 1963; Neter and Loebbecke 1977; Leitch et al. 1981; Bickel 1992). The accuracy of an estimator depends on the distribution of the estimator and on the sampling design. The most simple design is SRS for which a simple formula to compute the sample size can be found in standard textbooks. Suppose, for instance, that \(n_1\) is the sample size required by \({\hat{D}}_2\) under SRS (based on a preliminary estimate of the population variance of \(x_ky_k\)). For a more complex design, the use of the “design effect” is often recommended (see for example, Survey Methods and Practices of Statistics Canada 2010, Chapter 8). The design effect is defined as the ratio between the MSE under the complex design and the MSE under SRS. For instance, for OPTD, using the results of the previous section, we have
Then, the sample size required by OPTD to attain the same level of accuracy as SRS is \(n_2=n_2\times \mathrm{deff}\). This means that, in our example, OPTD requires about \(37\,\%\) of sampled units w.r.t. SRS. A similar computation shows that OPTD requires about \(66\,\%\) of sampled units w.r.t. MUS.
A note on inference
Statistical inference based on estimators produced by balanced sampling can be obtained by means of the “residual technique” proposed by Deville and Tillé (2005). For the estimator \({\hat{D}}_2\), this technique provides the following result. Suppose that \({\mathbf{u}}\in {\mathbb {R}}^r\) is the vector of balancing variables. Let \(c_k=n(1\pi _k)/(nr)\) and
Then,
We used this formula to compute Gaussian based confidence intervals for D in a simulation experiment based on the Swiss hospital data. For three sample sizes—\(100,\ 200\), and 400—we draw 10,000 samples according to the strategies cSRS, OPTD, MUS, and iMUS defined above. For each sample, we computed two confidence intervals with nominal coverages 90 and \(95\,\%\). The empirical coverages (proportions of intervals including D) are reported in Table 6. We observe that, with the exception of the uncalibrated MUS, the empirical coverages are generally somewhat lower than the nominal levels. However, they converge to the nominal values for increasing n and are quite satisfactory for most practical purposes. Bootstrap alternatives are also available (see e.g., Bickel 1992). (The surprising behavior of the uncalibrated MUS can be explained as follows: Pea et al. 2007 have shown that the number of samples with nonnull selection probability is not larger than the population size, which is not very large. Therefore, the sample distribution of the estimators is rather degenerated and not normal.)
Other potential applications
In the previous sections, we described the details of the practical implementation of the optimal sampling design in the case of hospital bill audits; however, other audit procedures share similar features and can benefit from the optimal design. For example, in accounts receivable (AR) auditing, the covariables related to the risk of misstated transactions are those characterizing the complexity of sales orders, the customer creditworthiness, the billing and shipping procedures, and the completeness of the AR general ledger. In many cases, past information about these covariables is available and thus, statistical modeling of the risk is possible. The main goal is to obtain an estimate of the difference D between the original and the audit grand total of AR in the ledger. A specific example is the internal control of tax assessment of a governmental revenue service. This kind of audit can be repeated each year under similar conditions with the purpose of providing a reliable audit forecast of the total tax revenue (i.e., the total AR) to the Treasury. Since hundred of thousands of assessments are generated and invoiced each year, an exhaustive control is out of the question. Different types of information associated with the risk of an incorrect assessment in tax controls can be found in the Manual Audit Sampling available on the web site of the Multistate Tax Commission (http://www.mtc.gov). For example, the assessment of the tax declaration of a modest pensioner is much simpler than the one of a wealthy entrepreneur or of a business company and the associated risks of misstatement and error amounts are clearly not alike. However, an extensive body of information about past assessments is available to the revenue service allowing an estimation of both the probability of misstatement as a function of the available covariables and the relationship between the error amount and the covariables. Therefore, an optimal sampling design, which focuses directly on the most problematic cases can be enforced. Notice, however, that the selection still remains random and thus continues covering and representing the entire population. Therefore, different aspects of the assessment quality (i.e., different types of errors—other than D—related to specific items of the assessment) can be investigated with the help of the observed sample.
Discussion
We have shown the importance of modeling the error generating process in the design of audit sampling. When preliminary data providing auxiliary information are available, it is possible to model the probability of error occurrence and the error magnitude with the help of auxiliary covariables. In this case, optimal strategies of audit sampling can be determined, which minimize the mean squared error of the estimated amount of the misstatement. With the help of real data from an audit experience of the Swiss hospital billing system we have shown that these strategies require a smaller sample size with respect to other sampling strategies (including MUS and stratification) in order to attain the same level of accuracy. To our knowledge, this apparently expected result has never been published in the auditing literature.
We introduced a very flexible twostage model of the error generating process. The first stage logistic model M0 is the most popular model relating the probability of a Bernoulli variable—such as the random occurrence of an error in the written value—to a set of auxiliary covariables. The second stage model M1 asserts that, when present, the error \(y_kx_k\) is proportional to \(x_k\), which is a typical feature of many audit populations. Note that M1 may include an intercept term (set \(z_{1k}=1/x_k\) for all k) and that it assumes nothing more than \(\mathrm{E}_M(\varepsilon _k)=0\) and \(\mathrm{Var}_M(\varepsilon _k)=\sigma ^2\) about the shape of the error distribution. Therefore, the combined model M0–M1 should be useful for a very large class of audit settings, including those—such as Accounts Receivable audit—where only skewed distributed overstatements \(x_ky_k >0\) are possible. In addition, we will show below that MUS is the optimal design under the simplest version of M0–M1, when no auxiliary covariable is available. In some occasions—as in our example with hospital data—exploratory data analysis from previous audits may suggest that a loglog transformation may improve the fit and that M2 is more adequate than M1. Note however, that M2 assumes a normal error and that this assumption must be checked. The models M0–M1 or M0–M2 can be estimated using standard statistical software.
All optimal strategies overselect transactions that have at the same time a large probability of being wrong and a considerable amount of error. On the one hand, these strategies allow a very accurate estimation of the total amount of misstatements and, therefore, an optimal sample size; on the other hand, they require checking a large number of “heavy transactions”, which may increase the burden of work in examining these cases.
With the help of the real data set, we have provided a detailed guideline for the implementation of the optimal strategies in practice. The model estimation is based on previous audit experiences for a population which is similar to the one about which the auditor wishes to draw conclusions; the larger the experience, the better the model adequacy will be. However, the optimal sampling designs are balanced and use unequal inclusion probabilities. In addition, calibration is often necessary in order to fully exploit the optimality results. Therefore, sophisticated procedures are necessary to combine these requirements and implement these designs. Today, specialized software is available which simplifies the use of these techniques. In our examples, we have used the R package Sampling described in Tillé and Matei (2012). A SAS macro has also been developed by Chauvet and Tillé (2005).
We have also shown that the optimal strategies are usually more efficient than MUS. However, MUS remains the technique of choice when the error magnitude increases with the amount of the transaction but it is not possible to use auxiliary information to estimate the twostage models. In this case, the equations of models M0 and M1 become:
and
where c is a constant. Therefore, \(\psi\) does not depend on k. Moreover, \(\mathrm{E}_M(j_k)=\psi\) and \(\mathrm{Var}_M(j_k)=\psi (1\psi ).\) The variables \(\varepsilon _k\) are assumed to be independent and such that \(\mathrm{E}_M(\varepsilon _k)=0\), and \(\mathrm{Var}_M(\varepsilon _k)=\sigma ^2.\) The variables \(j_k\) and \(\varepsilon _k\) are also assumed to be independent. With this simplified model, Result 1 is modified as follows.
Under model (10), the sampling design that minimizes the anticipated \(\mathrm{MSE}\) of \(\hat{D}_1\) and \(\hat{D}_2\) must be balanced on the variable x, and the inclusion probabilities \(\pi _k\) are proportional to \(x_k\).
This optimal solution coincides with MUS.
Finally, the reader may note that we provided a technique to compute approximate confidence bounds for D. This technique worked fine in our example with real data because the error distribution was not very skewed. In accounts receivable populations, the usual goal is to set an upper confidence bound for D. The usual estimator of D is \(X n^{1} \sum _{k\in S} t_k\), where \(t_k=(x_ky_k)/x_k\) is the taint of transaction k. This estimator coincides with \({\hat{D}}_2\); however, it is usually assumed that \(0\le t_k\le 1\), i.e., only overstatements are possible (with maximum error the book amount) and the taint distribution is very skewed. When MUS is employed, the Stringer bound (Stringer 1963) is the most popular upper confidence bound for D but some alternative bounds are available (see e.g., Bickel 1992; Higgins and Nandram 2009). Further research is necessary to obtain a Stringer type bound when the optimal design is used. At present, only the bootstrap alternatives are available for this purpose.
References
AAMAS (2009) National health care billing audit guidelines. Technical report, The American Association of Medical Audit Specialists
AICPA (2008) Audit guide: audit sampling. Technical report. American Institute of Certified Public Accountants
Bickel PJ (1992) Inference and auditing: the Stringer bound. Int Stat Rev 60(2):197–209
Chauvet G, Tillé Y (2005) Fast SAS macros for balancing samples: user’s guide. Software Manual, University of Neuchâtel. http://www2.unine.ch/statistics/page10890.html
Deville JC, Särndal CE (1992) Calibration estimators in survey sampling. J Am Stat Assoc 87:376–382
Deville JC, Tillé Y (2004) Efficient balanced sampling: the cube method. Biometrika 91:893–912
Deville JC, Tillé Y (2005) Variance approximation under balanced sampling. J Stat Plann Inference 128:569–591
Dickhaut JW, Eggleton IRC (1975) An examination of the processes underlying comparative judgements of numerical stimuli. J Account Res 13:38–72
Fetter RB, Shin Y, Freeman JL, Averill RF, Thompson JD (1980) Case mix definition by diagnosisrelated groups. Med Care 18:1–53
Gafford WW, Carmichael DR (1984) Materiality, audit risk and sampling—a nutsandbolts approach (part one). J Account 158(4):109–110
Gemayel NM, Stasny EA, Tackett JA, Wolfe DA (2011) Ranked set sampling: an auditing application. Rev Quant Finance Account 39:413–422
Grimlund RA, Felix D (1987) Simulation evidence and analysis of alternative methods of evaluating dollarunit samples. Contemp Account Res 62(3):455–480
Hansen SC (1993) Strategic sampling, physical units sampling, and dollar units sampling. Account Rev 68(2):232–345
Higgins HN, Nandram B (2009) Monetary unit sampling: improving estimation of the total audit error. Adv Account 25(2):174–182
Hoogduin LA, Hall TW, Tsay JJ, Pierce BJ (2015) Does systematic selection lead to unreliable risk assessments in monetaryunit sampling applications? Audit J Pract Theory 34(4):85–107
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47:663–685
Hosmer DW, Lemeshow S (2013) Applied logistic regression. Wiley, New York
Kaplan RS (1973) Statistical sampling in auditing with auxiliary information estimators. J Accout Res 11(2):238–258
Leitch RA, Neter J, Plante R, Sinha P (1981) Implementation of upper multinomial bound using clustering. J Am Stat Assoc 76(375):530–533
Leslie DA, Teitlebaum AD, Anderson RJ (1980) Dollarunit sampling: a practical guide for auditors. Pitman, London
Madow WG (1949) On the theory of systematic sampling, II. Ann Math Stat 20:333–354
Nedyalkova D, Tillé Y (2008) Optimal sampling and estimation strategies under linear model. Biometrika 95:521–537
Neter J, Loebbecke JK (1977) On the behavior of statistical estimators when sampling accounting populations. J Am Stat Assoc 72(359):501–507
Pea J, Qualité L, Tillé Y (2007) Systematic sampling is a minimal support design. Comput Stat Data Anal 51:5591–5602
Rosenberg MA, Fryback DG, Katz DA (2000) A statistical model to detect drg upcoding. Health Serv Outcomes Res Methodol 1(3–4):233–252
SFAO (2014) Kontrolle von DRGSpitalrechnungen durch die Krankenversicherungen. Technical Report EFK14367, Swiss Federal Audit Office
Smieliauskas W (1986) Control of sampling risks in auditing. Contemp Account Res 3(1):102–124
Statistics Canada (2010) Survey methods and practices, catalogue no. 12587x. Technical report. Statistics Canada
Stringer KW (1963) Practical aspects of statistical sampling in auditing. In: ASA proceedings of the business and economic statistics section. American Statistical Association, pp 405–411
Tillé Y (2006) Sampling algorithms. Springer, New York
Tillé Y, Matei A (2012) Sampling: survey sampling. R package version 2.5
Tsui KW, Matsumura EM, Tsui KL (1985) MultinomialDirichlet bounds for dollarunit sampling in auditing. Account Rev 60(1):76–97
Acknowledgments
We are grateful to Sandro Prosperi and Giordano Macchi for their helpful suggestions.
Author information
Affiliations
Corresponding author
Appendix
Appendix
Proof of equations (3)–(6)
We have
and since \(\varepsilon _k\) has a normal distribution,
Therefore,
and
It follows that
Using \(a_k=\exp ({\mathbf{z}}_k^{\rm T}\beta +\sigma ^2/2)\), we have
Finally,
Lemma 1
Under model M1, we have
Proof
We use the following general result. If
with \(\mathrm{E}_M(u_k)=0\), \(\mathrm{Var}_M(u_k)=\sigma _{uk}^2\), \(\mathrm{Cov}_M(u_k,u_\ell )=0\), then
A proof is available, among others, in Nedyalkova and Tillé (2008). Now, under model M1, we have
Since \(\mathrm{E}_M[x_k (j_k\psi _k)( {\mathbf{z}}_k^{\rm T} \beta + \varepsilon _k)] =0\),
We now put \(u_k = x_k[ (j_k\psi _k) {\mathbf{z}}_k^{\rm T} \beta + j_k \varepsilon _k\), and apply (11) to (12). We obtain
Lemma 2
Under model M1, we have
Proof
We have
We use (13) again, and apply (11) to (15). We obtain
Proof of Result 1
In order to minimize (13), we can first select a sample that is balanced on \(x_k\) and \(x_k\psi _k {\mathbf{z}}_k\), i.e. a sample such that
In this case, the first term of (13) vanishes. Now, if we minimize the second term with respect to \(\pi _k\)
subject to
we find, using the Lagrange technique,
Lemma 3
Under model M2, we have
where \(a_k\) and \(v_k\) are defined in (5)(6).
Lemma 4
Under model M2, we have
Proofs Lemma 3, Lemma 4, and Result 2
The proofs are the same the proofs of Lemmas 1 and 2, and Result 3 if we notice that M2 can be written as a linear heteroscedastic model:
where \(r_k\) is defined in (4), \(a_k\) is defined in (5), and \(v_k\) is defined in (6).
Lemma 5
Proof
We only have to note that
Since \(\mathrm{E}_M(j_k\psi _k)=0\) and \(\mathrm{Var}_M(j_k\psi _k)=\psi _k(1\psi _k)\), we can proceed as in the proof of Result 1. \(\square\)
Proof of Result 3
The proof is the same as for Result 1.
A Monte Carlo Simulation A population of size \(N=1000\) was generated according to models M0 and M1 as follows:

the \(x_k\) are independent lognormal variables with parameters \(\mu =0.7\) and \(\sigma =0.4\),

\({\mathbf{v}}_k=(1,x_k,b_k)^{\rm T}\), \(\gamma _k=(1,0.007,1.0)^{\rm T}\),

the \(b_k\) are independent Bernoulli variables such that \(\mathrm{Pr}(b_k=1)=0.8\),

\({\mathbf{z}}_k = (1,z_k)^{\rm T}\), \(\beta =(0,0.5 )^{\rm T}\),

the \(z_k\) are independent normal variables such that \(z_k\sim N(0.2,0.3^2),\)

the \(\varepsilon _k\) are independent normal variables such that \(\varepsilon _k\sim N(0,0.1^2)\).
Figure 3 shows the generated population as well as \(\psi _k =\exp ({\mathbf{v}}_k^{\rm T} \gamma )/(1+\exp ({\mathbf{v}}_k^{\rm T} \gamma ))\) as a function of \(x_k\) and \(b_k\). There are \(J=292\) incorrect transactions and \(D=2313.277\). There are more errors on small transactions than on large transactions. The variables \(b_k\) define two groups with different error levels. The amount of errors depends on \(z_k.\) We compared the following strategies:

cSRS: Simple random sampling without replacement, calibrated on x,

OPTD: Optimal design for D, balanced and calibrated on x and \(x\psi {\mathbf{z}}\),

OPTJ: Optimal design for J, balanced on \(\psi\), calibrated on x and \(\psi\),

MUS: naturally balanced on x,

iMUS: improved MUS balanced on \(x\psi\) and \(x\psi {\mathbf{z}}\), calibrated on x, \(x\psi\) and \(x\psi {\mathbf{z}}.\)
OPTD and OPTJ were based on Results 1 and 3. Calibration was computed according to the raking ratio technique (Deville and Särndal 1992) . We selected 10,000 samples of size 100. Table 7 reports the empirical \(\mathrm{MSE}\) (mse) of \({\hat{D}}={\hat{D}}_1={\hat{D}}_2\) and \({\hat{J}}\). The simulation confirms the theory: OPTD is the best strategy to estimate D and OPTJ is the best strategy to estimate J. MUS is a usable tool for estimating D but can be improved by balancing and calibrating.
Rights and permissions
About this article
Cite this article
Marazzi, A., Tillé, Y. Using past experience to optimize audit sampling design. Rev Quant Finan Acc 49, 435–462 (2017). https://doi.org/10.1007/s1115601605967
Published:
Issue Date:
Keywords
 Audit
 Hospital bill audit
 Monetary unit sampling
 Dollar unit sampling
 Balanced sampling
 Horvitz–Thompson Estimator
JEL Classification
 C61
 C83
 M41
 M42
 H83