Reference Work Entry

Encyclopedia of Biometrics

pp 1328-1332

Test Sample and Size

  • Michael E. SchuckersAffiliated withSt. Lawrence University

Synonyms

Crew designs; Sample size; Target population

Definition

The testing and evaluation of biometrics is a complex task. The difficulties in such an endeavor include the selection of the number and type of individuals that will participate in this process of testing. Determining the amount of data to be collected is another important factor in this process. Choosing an appropriate set of individuals from which to collect biometrics data is another important aspect of testing a biometrics system.

Introduction

The assessment of a biometric system’s matching performance is an important part of evaluating such a system. A biometric implementation is an ongoing process and as such will be treated as a process in the sense of Hahn and Meeker [1]. Thus, any inference regarding that process will be analytic in nature rather than enumerative as delineated by Deming [2]:

An enumerative study has for its aim an estimate of the number of units of a frame that belong to a specified class. An analytic study has for its aim a basis for action on the cause-system or the process, to improve product of the future.

Here focus is on determining the amount and type of data necessary for assessing the current matching performance of a biometrics system.

The matching performance measures that are commonly considered most important are the false match rate (FMR) and the false non-match rate (FNMR). One of the important parts of designing a test of a biometrics system is to determine, prior to completion, the amount of testing that will be done. Below calculations that explicitly allow for determining the amount of biometric data which will be sampled are described. As with any calculations of this kind it is necessary to make some estimates about the nature of process beforehand. Without these, it is not possible to determine the amount of data to collect. These sample size calculations will be derived to achieve a certain level of sampling variability. It is important to recognize that there are other potential sources of variability in any data collection process.

Selection of the individuals from whom these images will be taken is another difficulty because of the need to ensure that the biometric samples taken are representative of the matching and decision making process. The goal of any data collection should be to take a sample that is as representative as possible of the process about which inference will be made. Ideally, some probabilistic mechanism should be utilized to select individuals from a targeted population. In reality, because of limitations of time and cost, this is a difficult undertaking and often results in a convenience sample, Hahn and Meeker [1].

Test Size Calculation

Determining the amount of biometric information to collect is an ongoing concern for the evaluation of a biometrics system. Several early attempts to address this problem include those by Wayman [3] and [4] as well as the description in Mansfield and Wayman [5] of the “Rule of 3” and the “Rule of 30”. The former is due to several authors including Louis [6] as well as Jovanovic and Levy [7], while the latter, the so-called Doddington’s Rule, is due to Doddington et al. [8]. Mansfield and Wayman note that neither of these approaches is satisfactory since they assume that error rates are due to a “single source of variability”, which is not generally the case with biometrics. Ten enrolment-test sample pairs from each of a hundred people is not statistically equivalent to a single enrolment-test sample pair from each of a thousand people, and will not deliver the same level of certainty in the results.

Effectively, the use of either the “Rule of 3” or the “Rule of 30” requires the assumption that the decisions used to estimate error rates are uncorrelated. More recently, Schuckers [9] provided a method for dealing with the issue of the dual sources of variability and the resulting correlations that arise from this structure.

The calculation given below is for the determination of the number of comparison pairs, n, from which samples need to be taken. Define a comparison pair, similar to the enrolment-test sample pair of Mansfield and Wayman [5], as a pair of possibly identical individuals from whom biometric data or images have been taken and compared. If the two individuals are the same then call the comparison pair a genuine one. If the two individuals are distinct then call the comparison pair an imposter one. In order to use this information to determine test size, it is necessary to specify some estimates of the process parameters before the data collection is complete. In order to obtain sample size calculations it is necessary to make these specifications. It is worthwhile noting here that most other biological and medical disciplines use such calculations on a regular basis and the U.S. Food and Drug Administration requires them for clinical trials. Approaches to carrying this out are discussed below.

Let the error rate of interest, either FMR or FNMR, for a process be represented by γ and let Y ij represent the decision for the jth pair of captures collected on the ith comparison pair, where n is the number of comparison pairs, i = 1, …, n and j   1, …, m i . Thus, the number of decisions that are made for the ith comparison pair is m i , and n is the number of different comparison pairs being compared. Define
$${ Y }_{ij} = \left\{\matrix{ 1 & {{\rm if }\, j {\rm th}\,{\rm decision}\, {\rm from}\,{\rm comparison}\, {\rm pair}\, i \, {\rm is}\, {\rm incorrect}}\cr 0 &{\rm otherwise}}.\right.$$
(1)
Assume for the Y ij 's that E[Y ij ] = γ and V [Y ij ] = γ(1 − γ) where E[X] and V [X] represent the mean and variance of X, respectively. Estimation of γ is done separately for FNMR and FMR and so there is a seperate collection of Y ij ’s for each. The form of the variance is a result of each decision being binary. The correlation structure for the Y ij ′s is
$$Corr({Y }_{ij},{Y }_{i'j'}) = \left\{\matrix{ 1& {\rm if}&i = i' ,\, j = j' \cr \rho &{\rm if}&i = i',\,j{\not =}j' \cr 0 & & {\rm otherwise} }\right.$$
(2)

This correlation structure is based upon the idea that there will only be correlations between decisions made on the comparison pair but not between decisions made on different comparison pairs. Thus, conditional upon the error rate, there is no correlation between decisions on the ith comparison pair and decisions on the i′th comparison pair, when ii′. The degree of correlation is summarized by ρ. This is not the typical Pearson’s correlation coefficient, rather it is the intra-class correlation or here the intra-comparison pair correlation. More details can be found in Schuckers [10].

Derivation of sample size calculations requires an understanding of sampling variability in the estimated error rate. Thus consider
$$\hat{V }[\hat{\gamma }] = {N}^{-2}\hat{\gamma }(1 -\hat{ \gamma })\left[N +\hat{ \rho }{\sum \limits_{i=1}^{n}{m}_{ i}({m}_{i} - 1)}\right],$$
(3)
where N = ∑ n i = 1 m i , and \(\hat{\gamma } = {N}^{-1}{ \, {\sum \nolimits _{i=1}^{n}}}\, {{\sum \nolimits _{j=1}^{{m}_{i}}}}{Y }_{ij}\). Fleiss et al. [11] has suggested the following moment-based estimator for ρ
$$\eqalign{ {\hat{\rho}} = &{\Bigg({\hat{\gamma} }(1 -{\hat{ \gamma }}){\sum \limits_{i=1}^{n}{m}_{i}({m}_{i} - 1)} \Bigg)^{-1}}\cr &\sum \limits_{i=1}^{n} \sum\limits_{j=1}^{{m}_{i} }{\sum \limits_{{ j'=1 \atop j'\mathrel{\not =}j} }^{{m}_{i}}}({Y}_{ij}-\hat{\gamma })({Y}_{ij'}-\hat{\gamma }).}$$
(4)
Since \(\hat{\gamma }\) is a linear combination, if n is large it is reasonable to assume that the central limit theorem holds, Serfling [12]. To produce a (1 − α) × 100% confidence interval for γ use
$$\hat{\gamma } \pm {z}_{\alpha /2}\sqrt{{N}^{-2 } \hat{\gamma }(1-\hat{ \gamma })\bigg[N +\hat{ \rho }{\sum\limits_{i=1}^{n}{m}_{i}({m}_{i} - 1)}\bigg]},$$
(5)
where z α ∕ 2 represents the 1 − α ∕ 2th percentile of a Gaussian distribution with mean 0 and variance 1. Further, if m i = m for all i (3) simplifies to
$$V [\hat{\gamma }] = {(nm)}^{-1}\hat{\gamma } (1 -\hat{\gamma })\left[1 + \rho (m - 1)\right],$$
(6)
where N has been replaced by nm. This form will be used to derive sample size calculations.
Turning from variance estimation to sample size calculations, set the portion of (6) after the  ± , the margin of error, equal to some desired value B and solve for n, the number of comparison pairs. Then the following sample size calculation for making a 100(1 − α)% CI with a specified margin of error of B is obtained.
$$n = \left \lceil {{z}_{1-{\alpha \over 2} }^{2}\gamma (1 - \gamma )(1 + (m - 1)\rho )\over m{B}^{2}} \right \rceil ,$$
(7)
where ⌈⌉ is the next largest integar or ceiling function. In order to create sample size calculations for a confidence interval, it is necessary to specify, among other things, the desired margin of error, B, for the interval. As mentioned above there are effectively two sample sizes when dealing with performance evaluation for biometric authentication devices. This derivation here is for the number of comparison pairs, n, that need to be tested and assume that the number of decisions per individual is fixed and known. This is equivalent to assuming that m i = m for all i and that m is known. In practice it will be possible to determine different values for n by varying m before proceeding with a evaluation. As with all sample size calculations it is important to note that specification of a priori values for the parameters in the model is necessary. In this case that means it is necessary to estimate values for γ and ρ to be able to determine the number of individuals, n. Several strategies are reasonable and have been discussed in the statistics literature for these a priori specifications. See, e.g., Lohr [13]. Ideally, it would be possible to make a pilot study of the process under consideration and use actual process data to estimate these quantities. Alternatively, it may be possible to use estimates from other studies perhaps done under similar circumstances or with similar devices. The last possibility is to approximate based upon prior knowledge without data. Regardless of the method used it is important to recognize that n is a function of α, B, m, γ and ρ. n varies directly with γ and ρ and inversely with α, m and B. Thus, a conservative approach to estimation of these quantities would overestimate γ and ρ and underestimate m. This will produce a value for n that is likely to be larger than required. Table 1 illustrates the use of (7). It is also worth noting that most studies of this type have a significant drop out rate of individuals as the data collection progresses. Thus it is adviseable to plan a collection process that assumes some attrition in the number of comparison pairs to be selected. The values of α and B are likely to be set by investigators or by standards bodies rather than the performance of the process under study.
Test Sample and Size. Table 1

Illustration of the use of (7)

α

B

γ

m

ρ

n

 

0.05

0.005

0.01

10

0.4

700

 

0.05

0.01

0.01

10

0.4

175

 

0.01

0.005

0.01

10

0.4

1,209

 

0.05

0.005

0.02

10

0.4

1,386

 

0.05

0.005

0.01

5

0.4

792

 

0.05

0.005

0.01

10

0.1

290

 

Equation (7) is straightforward for calculation of the number of comparison pairs that need to be tested when γ = FNMR. It is less so when interest centers on γ = FMR. This is because for FNMR the number of comparison pairs translates to the number of individuals, while for FMR the number of comparison pairs is not proportional to the number of individuals. If all cross-comparisons are used to estimate FMR, then one can replace n with n  ∗ (n  ∗  −  1) in (7). In that case n  ∗  will be the number of individuals that need to be tested.

Sample Selection

Once the number of individuals to be selected is determined, another important step is to specify the target population of individuals to whom statistical inference will be made. Having done so, a sample would ideally be drawn from that group. However, this is not possible often. The next course of action is to specify a sample that is as demographically similar to the target population as possible. The group of individuals that will compose the sample is often referred to as the “volunteer crew” or simply the “sample crew”, Mansfield and Wayman [5]. The more similar the sample crew is to the target population the more probable it will be that the estimates based upon the sample crew will be applicable to the target population. Often the sample crew is chosen to be a convenience sample, Hahn and Meeker [1]. Methodology for best selecting the sample crew is an open area of research in biometrics.

One useful tool for extrapolation from estimates based upon the “crew” is post-stratification. Poststratificati​on is a statistical tool for weighting a sample representation after the sample has been taken so that resulting estimates reflect the known population. Suppose that, there are H non-overlapping demographic groups of interest, or strata, and n h individuals have been sampled from among the N h total individuals in each strata. Further suppose that estimates of the error rate, \(\hat{{\gamma }}_{h}\), from each of the strata are known. Then a poststratified estimate of the error rate is
$$\hat{{\gamma }}_{ps} ={{\sum \limits_{h=1}^{H} {{n}_{h}\over { N}_{h}}\hat{{\gamma }}_{h}}}.$$
(8)
An estimate of the variability of the predicted error rate is
$$\hat{V }[\hat{{\gamma }}_{ps}] ={ {\sum \limits_{h=1}^{H}{\left( {{n}_{h}\over { N}_{h}}\right)}^{2}\hat{V }[\hat{{\gamma }}_{ h}],}}$$
(9)
where \(\hat{V }[\hat{{\gamma }}_{h}]\) can be calculated using the equation found above. A (1 − α) × 100% poststratification confidence interval for the process error rate can then be made using
$$\hat{{\gamma }}_{ps} \pm {z}_{\alpha /2}\sqrt{\hat{V }[\hat{{\gamma }}_{ps } ]}.$$
(10)
As above, use of the Gaussian distribution here is justified by the fact that the estimated error rate, \(\hat{{\gamma }}_{ps}\), is a linear combination of random variables.

Summary

Testing and evaluation of biometric devices is a difficult undertaking. Two crucial elements of this process are the selection of the number of individuals from whom to collect data and the selection of those individuals. Determining the number of individuals to test can be calculated based on (7). To obtain the number of individuals that need to be tested, some process quantities need to be specified. These specification can be based on previous studies, pilot studies or on qualified approximations. Selection of the “crew” for a study is a difficult process. Ideally a sample from the target population is the best, but a demographically similar “crew” is often more attainable. The inference from a demographically similar crew can be improved by the use of poststratification.

Related Entries

Influential Factors to Performance

Performance Evaluation, Overview

Performance Measures

Performance Testing Methodology Standardization

Copyright information

© Springer Science+Business Media, LLC 2009
Show all