The a priori determination of a proper sample size necessary to achieve some specified power is an important problem frequently encountered in practical studies. To make inferences about differences between two normal population means, the hypothesis-testing procedure and corresponding sample size formula are well known and easy to apply. For important guidance, see the comprehensive treatments in Cohen (1988) and Murphy and Myors (2004). In the statistical literature, comparison of the means of two normal populations with unknown and possibly unequal variances has been the subject of much discussion and is well recognized as the Behrens–Fisher problem (Kim & Cohen, 1998). The existence and importance of violation of the assumption of homogeneity of variance in clinical research settings are also addressed in Grissom (2000). The practical importance and methodological complexity of the problem have occasioned numerous attempts to develop various procedures and algorithms for resolving the issue. Notably, several studies have shown that Welch’s (1938) approximate degrees of freedom approach offers a reasonably accurate solution to the Behrens–Fisher problem. Therefore, Welch’s procedure is routinely introduced in elementary statistics courses and textbooks. Moreover, some popular statistical computer packages, such as SAS and SPSS, have implemented the method for quite some time. In practice, power analyses and sample size calculations are often critical for investigators to credibly address specific research hypotheses and confirm differences. Thus, the planning of sample size should be included as an integral part in the study design. Accordingly, it is of practical interest and fundamentalimportance to be able to perform these tasks in the context of the Behrens–Fisher problem. The essential question is how to determine sample sizes optimally under different allocation and cost considerations that call for independent random samples from two normal populations with possibly unequal variances.

Conventional studies of power and sample size have not addressedmatters of allocationrestriction and cost efficiency, although researchers have been exploring design strategies that take into account the impact of different constraints of the sample scheme and project funding while maintaining adequate power. Specifically, the allocation ratio of group sizes was fixed in the calculation of sample size for comparing independent proportions in Fleiss, Tytun, and Ury (1980), while Heilbrun and McGee (1985) considered sample size determination for the comparison of normal means with a known ratio of variances and one sample size being specified in advance.In an actual experiment, however, the available resources are generally limited, and it may require different amounts of effort and costs to recruit subjects for the treatment and the control groups. Assuming homogeneous variances, Nam (1973) presented optimal sample sizes to maximize power for the comparison of the treatment and control under budget constraints. Conversely, Allison, Allison, Faith, Paultre, and Sunyer (1997) advocated designing statistically powerful studies while minimizing costs. Interested readers are referred to recent articles by Bacchetti, McCulloch, and Segal (2008) and Bacchetti (2010) for alternative viewpoints and related discussions.

Within the framework of the Behrens–Fisher problem, assuming a desired sample size ratio, Schouten (1999) derived an approximate formula for computing sufficient sample size for a selected power. In addition, in Schouten (1999), a simplified sample size formula was proposed to minimize the total cost when the cost of treating a subject varies with experimental groups. Also, Lee (1992) determined the optimal sample sizes for a designated power so that the total sample size is minimized. It is important to note that the setting in Lee can be viewed as a special case of Schouten. However, unlike the exact approach of Lee, the presentation of Schouten involved several approximations, including the use of a normal distribution, which does not conform to the notion of a t distribution with approximate degrees of freedom proposed in Welch (1938). Alternatively, Singer (2001) modified the simple formula of Schouten by replacing the percentiles of the standard normal distribution with those of a t distribution with approximate degrees of freedom. Unfortunately, the resulting formulation is questionable on account of its absence of theoretical justification. Detailed analytical and empirical examinations are presented later to demonstrate the underlying drawbacks associated with the approximate procedures of Schouten and Singer. Moreover, Luh and Guo (2007), Guo and Luh (2009), and Luh and Guo (2010) extended the approximations of Schouten and Singer to the two-sample trimmed mean test with unequal variances under allocation and cost considerations. Basically, when the trimming proportion is 0, the procedures of Guo and Luhare applicable for the Behrens–Fisher problem. However, their procedures are still approximate in nature and possess the same disadvantages of Schouten’s and Singer’s.More important, the algorithms employed by Guo and Luhfail to take into account the underlying metric of integer sample sizes and often lead to suboptimal results. From a methodological standpoint, the results in Schouten, Singer, Luh and Guo (2007), Guo and Luh, and Luh and Guo (2010) should be reexamined with technical clarifications and exact computations. Nonetheless, our calculations not only show that the prescribed approximate methods do not guarantee giving correct optimal sample sizes, but also reveal that some of the optimal sample sizes reported in the empirical illustrations of Lee are actually suboptimal. Due to the discrete character of sample size, it requires a detailed inspection of sample size combinations to find the optimal allocation that attains the desired power while giving the least total sample size. This extra step and resulting merit in sample size determination is not considered by Lee. The theoretical and numericalexaminations conducted here provide a comprehensive comparison of the various procedures available to date.In short, the accuracy of the existing sample size procedures for the Behrens–Fisher problem can be further improved by adapting an exact and refined approach.

As was described above, there are important and useful considerations or strategies for study design other than the minimization of total sample size or total cost. Since Welch’s (1938) approach to the Behrens–Fisher problem is so entrenched, it is prudent to present a comprehensive exposition of design configurations in terms of diverse allocation schemes and budget constraints. Here, exact methods are presented to give proper sample sizes either when the ratio of group sizes is fixed in advance or when one sample size is fixed. In addition, detailed procedures are provided to determine the optimal sample sizes that maximize the power for a given total cost and that minimize the cost for a specified power. More important, the corresponding computer algorithms are developed to facilitate computation of the exact necessary sample sizes in actual applications.

Due to the prospective nature of advance research planning, it is difficult to assess the adequacy of selected configurations for model parameters in sample size calculations. The general guideline suggests that typical sources such as previously published research and successful pilot studies can offer plausible and reasonable planning values for the vital model characteristics (Thabane et al., 2010). However, the potential deficiency of using a pilot sample variance to compute the sample size needed to achieve the planned power for one- and two-sample t-tests has been examined by, among others, Browne (1995) and Kieser and Wassmer (1996). They showed that the sample sizes provided by the traditional formulas are too small, since they neglect the imprecise nature of a variance estimate. Note that all standard sample size procedures share the same fundamental weakness when sample variance estimates are used for the underlying population parameters.However, the issue is more involved, and a detailed discussion of this topic is beyond the scope of the present study. The interested reader is referred to Browne, Kieser, and Wassmer, and the references therein for further details.

The Welch test

As part of a continuing effort to improve the quality of research findings, this research contributes to the derivation and evaluation for sample size methodology of Welch’s (1938) approximate t test for the Behrens–Fisher problem. Consider independent random samples from two normal populations with the following formulations:

$$ {X_{{ij}}}\sim N\left( {{\mu_i},\sigma_i^2} \right), $$

where μ 1, μ 2, \( \sigma_1^2 \), and \( \sigma_2^2 \) are unknown parameters, j = 1, . . . , N i , and i = 1 and 2. For detecting the group effect in terms of the hypothesis H0: μ 1 = μ 2 versus H1: μ 1μ 2, the well-known Welch’s t statistic has the form

$$ V = \frac{{{{\bar{X}}_1} - {{\bar{X}}_2}}}{{{{\left( {S_1^2/{N_1} + S_2^2/{N_2}} \right)}^{{1/2}}}}}, $$

where \( {\bar{X}_1} = \sum\limits_{{j = 1}}^{{{N_1}}} {{X_{{1j}}}/{N_1},{{\bar{X}}_2} = \sum\limits_{{j = 1}}^{{{N_2}}} {{X_{{2j}}}/{N_2},S_1^2 = \sum\limits_{{j = 1}}^{{{N_1}}} {{{\left( {{X_{{1j}}} - {{\bar{X}}_1}} \right)}^2}/\left( {{N_1} - 1} \right),{\text{and}}\,S_2^2 = \sum\limits_{{j = 1}}^{{{N_2}}} {{{\left( {{X_{{2j}}} - {{\bar{X}}_2}} \right)}^2}/\left( {{N_2} - 1} \right)} } } } \). Under the null hypothesis H0: μ 1 = μ 2, Welch (1938) proposed the approximate distribution for V:

$$ V\dot{ \sim }t\left( {\hat{v}} \right), $$
(1)

where \( t\left( {\hat{v}} \right) \) is the t distribution with degrees of freedom \( \hat{v} \) and \( \hat{v} = \hat{v}\left( {{N_1},{N_2},S_1^2,S_2^2} \right) \), with

$$ {1}/\hat{v} = \frac{1}{{{N_1} - 1}}{\left\{ {\frac{{S_1^2/{N_1}}}{{S_1^2/{N_1} + S_2^2/{N_2}}}} \right\}^2} + \frac{1}{{{N_2} - 1}}{\left\{ {\frac{{S_2^2/{N_2}}}{{S_1^2/{N_1} + S_2^2/{N_2}}}} \right\}^2}. $$

Hence, H0 is rejected at the significance level α if \( \left| V \right| > {t_{{\hat{v},\alpha /2}}} \), where \( {t_{{{{\hat{v}}_{{,a/{2}}}}}}} \) is the upper 100(α/2)th percentile of the t distribution \( t(\hat{v}) \). The same notion was independently suggested by Smith (1936) and Satterthwaite (1946), and the test is sometimes referred to as the Smith–Welch–Satterthwaite test.

It is important to emphasize that the degrees of freedom \( \hat{v} \) is bounded by the smaller of N 1−1 and N 2−1 at one end and by \( {N_{{1}}} + {N_{{2}}} - {2} \) at the other—that is, \( {\text{Min}}\left( {{N_{{1}}} - {1},{N_{{2}}} - {1}} \right) \leqslant \hat{v} \leqslant {N_{{1}}} + {N_{{2}}} - {2} \). Because the critical value t df, α/2 decreases as df increases, the approximate critical value \( {t_{{\hat{v},a/{2}}}} \) is slightly larger than that of the two-sample t test \( {t_N}_{{_{{1}} + {N_{{2}}} - {2},\alpha /{2}}} \) under homogeneity of variance assumptions. Although the differences between the two critical value saresmall with moderate to large sample sizes, they reflect the conceptual distinction between the corresponding Welch’s t test and the regular two-sample t test. Note that a standard normal distribution can be viewed as a t distribution with an infinite number of degrees of freedom. However, the close resemblance between a standard normal distribution and a t distribution never causesthe introductory courses or textbooks to omit the coverage of Student’s t distribution. Therefore, the theoretical distinction and implication between the critical value \( {t_N}_{{_{{1}} + {N_{{2}}} - { 2},a/{2}}} \) and a standard normal critical value z α/2 is highlyanalogous to that between \( {t_{{\hat{v},a/{2}}}} \) and \( {t_N}_{{_{{1}} + {N_{{2}}} - { 2},a/{2}}} \). Ultimately, the t approximation with the approximate degrees of freedom given in Eq. 1 serves as the prime solution to the Behrens–Fisher problem.

Although the underlying normality assumption in the above-mentioned two-sample location problem provides a convenient and useful setup, the exact distribution of Welch’s test statistic V is comparatively complicated and may be expressed in different forms (see Wang, 1971, Lee & Gurland, 1975, and Nel, van der Merwe, & Moser, 1990, for technical derivation and related details). For ease of presentation, we need to develop some notation. It follows from the fundamental assumption that \( Z = \left( {{{\bar{X}}_1} - {{\bar{X}}_2}} \right)/\sigma \sim N\left( {\delta, 1} \right) \), \( \delta = {\mu_d}/\sigma \), \( {\mu_d} = \left( {{\mu_1} - {\mu_2}} \right) \), \( {\sigma^2} = \sigma_1^2/{N_1} + \sigma_2^2/{N_2} \), \( W = \left( {{N_1} - 1} \right)S_1^2/\sigma_1^2 + \left( {{N_2} - 1} \right)S_2^2/\sigma_2^2 \sim {\chi^2}\left( {{N_1} + {N_2} - 2} \right) \) and \( B = \left\{ {\left( {{N_1} - 1} \right)S_1^2/\sigma_1^2} \right\}/W \sim {\text{Beta}}\left\{ {\left( {{N_1} - 1} \right)/2,\left( {{N_2} - 1} \right)/2} \right\} \). Thus, we consider the following alternative expression of V for its ease of numerical investigation:

$$ V = \frac{T}{{{H^{{1/2}}}}}, $$
(2)

where \( T = Z/{\left\{ {W/\left( {{N_{{1}}} + {N_{{2}}} - 2} \right)} \right\}^{{1/2}}} \sim t\left( {{N_1} + {N_2} - 2,\delta } \right),t\left( {{N_1} + {N_2} - 2,\delta } \right) \) is the noncentral t distribution with degrees of freedom \( {N_{{1}}} + {N_{{2}}} - {2} \), and noncentrality parameter δ, and \( H = [(\sigma_1^2/{N_{{1}}})\left\{ {B/p} \right\} + (\sigma_2^2/{N_{{2}}})\{ ({1} - B)/({1} - p)\} ]/{\sigma^{{2}}} \) with \( p = { }({N_{{1}}} - {1})/({N_{{1}}} + {N_{{2}}} - {2}) \). Note that the random variables Z, W, and B are mutually independent. Hence, T and B are independent. Also, it is important to note that \( {1}/\hat{v} = B_1^2/({N_{{1}}} - {1}){ } + B_2^2/({N_{{2}}} - {1}) \) where \( {B_{{2}}} = {1} - {B_{{1}}} \) and \( {B_{{1}}} = [(\sigma_1^2/{N_{{1}}})\left\{ {B/p} \right\}\left] / \right[(\sigma_1^2/{N_{{1}}})\left\{ {B/p} \right\}{ } + { }(\sigma_2^2/{N_{{2}}})\{ ({1} - B)/({1} - p)\} ] \). Hence, both H and are functions of the random variable B.

With the prescribed distributional properties in Eq. 2, the associated power function of V is denoted by

$$ \pi \left( {{\mu_d},\sigma_1^2,\sigma_2^2,{N_1},{N_2}} \right) = P\left\{ {\left| V \right| > {t_{{\hat{v}}}}_{{,\alpha /2}}} \right\} = P\left\{ {\left| T \right| > {t_{{\hat{v},\alpha /2}}} \cdot {H^{{1/2}}}} \right\} $$
(3)

The numerical computation of exact power requires the evaluation of the cumulative distribution function of a noncentral t variable and the one-dimensional integration with respect to a beta probability distribution function. Since all related functions are readily embedded in major statistical packages, the exact computations can be conducted with current computing capabilities. To determine sample size, the power function can be employed to calculate the sample sizes (N 1, N 2) needed to attain the specified power 1-β for the chosen significance level α and parameter values \( ({\mu_{{1}}},{\mu_{{2}}},\sigma_1^2,\sigma_2^2) \). Clearly, the power function is rather complex, and it usually involves an iterative process to find the solution, because both random variables V and \( {t_{{\hat{v},a/{2}}}} \)are functions of the sample sizes (N 1, N 2). In order to enhance the applicability of sample size methodology and the fundamental usefulness of Welch’s (1938) procedure, in subsequent sections this study considers design configurations allowing for different allocation constraints and cost considerations. The R(R Development Core Team, 2010) and SAS/IML (SAS Institute, 2008a) programs employed to perform the corresponding sample size calculations are available in the supplementary files.

Allocation constraints

Since there may be several possible choices of sample sizes N 1 and N 2 that satisfy the chosen power level in the process of sample size calculations, it is prudent to consider an appropriate design that permits unique and optimal result. The following two allocation constraints are considered because of their potential usefulness. First, the ratio r = N 2/N 1 between the two group sizes may be fixed in advance, so the task is to decidetheminimum sample size N 1(N 2 = rN 1) required to achieve the specified power level. Second, one of the two sample sizes—say, N 2—may be pre-assigned, and so the smallest size N 1 required to satisfy the designated power should be found.

Sample size ratio is fixed

Assume that the sample size ratio r = N 2/N 1 is fixed in advance. To facilitate computation, without loss of generality, the ratio can be taken as r ≥ 1. Then the power function \( \pi ({\mu_d},\sigma_1^2,\sigma_2^2,{N_{{1}}},{N_{{2}}}) \) of V becomes a strictly monotone function of N 1 when all other factors are treated as constants. A simple incremental search can be conducted to find the minimum sample size N 1 needed to attain the specified power 1-β for the chosen significance level α and parameter values \( ({\mu_{{1}}},{\mu_{{2}}},\sigma_1^2,\sigma_2^2) \). To simplify the computation, the large-sample normal approximation \( V\dot{ \sim }N(\delta, { 1}) \) can be used to provide initial values to start the iteration. Specifically, the starting sample size N 1Z computed by the normal approximation would be the smallest integer that satisfies the inequality

$$ {N_{{{1}Z}}} \geqslant (\sigma_1^2 + \sigma_2^2/r){({{\text{z}}_a}_{{/{2}}} + { }{{\text{z}}_{\beta }})^{{2}}}/\mu_d^2, $$
(4)

where zα/2 and z β are the upper 100(α/2)th and 100·βth percentiles of the standard normal distribution, respectively.

For illustration, when μ d = 1, α = .05 and 1 –β = .90, the sample sizes N 1 and N 2 = r·N 1 are presented in Table 1 for selected values of r = 1, 2, and 3, σ 1 = 1/3, 1/2, 1, 2, and 3, and σ 2 = 1. The actual power is also listed, and the values are marginally larger than the nominal level .90. Note that SAS procedure PROC POWER (SAS Institute, 2008b) provides the same feature to find the optimal sample sizes N 1 and N 2 with a given sample size ratio. However, it does not accommodate the extended settings in which one of the sample sizes is fixed and the more involved cost concernsthat we consider next.

Table 1 Computed sample sizes (N 1, N 2) and actual power when sample size ratio r = N 2/N 1 is fixed with μ d = 1, α = .05 and 1 –β= .90

One sample size is fixed

For ease of exposition, the sample sizeN 2 of the second group is held constant. Just as in the previous case, the minimum sample size N 1 needed to ensure the specified power 1−β can be found by a simple iterative search for the chosen significance level α and parameter values \( ({\mu_{{1}}},{\mu_{{2}}},\sigma_1^2,\sigma_2^2) \). In this case, the starting sample size N 1Z , based on the normal approximation, is the smallest integer that satisfies the inequality

$$ {N_{{{1}Z}}} \geqslant \sigma_1^2/\{ \mu_d^2/{({z_a}_{{/{2}}} + { }{{\text{z}}_{\beta }})^{{2}}} - \sigma_2^2/{N_{{2}}}\} . $$
(5)

However, it should be noted that this may be problematic when a small value of N 2 is chosen. If \( {N_{{2}}} < \sigma_2^2/\{ \mu_d^2/{({{\text{z}}_{\alpha }}_{{/{2}}} + { }{{\text{z}}_{\beta }})^{{2}}}\} \), then the initial value N 1Z is negative, which is obviously unrealistic. Moreover, for \( {N_2}\dot{ = }\sigma_2^2/\left\{ {\mu_d^2/{{\left( {{z_{{\alpha /2}}} + {z_{\beta }}} \right)}^2}} \right\} \), the resulting N 1Z and N 1values are unbounded, and the results do not have practical value.Accordingly, Table 2 presents the computed sample size N 1 and the actual power levels with the chosen valueN 2 for the same settings with μ d = 1, α = .05, 1-β = .90, and the variance combinations in Table 1.

Table 2 Computed sample sizes (N 1, N 2) and actual power when sample size N 2 is fixed with μ d = 1, α = .05 and 1 –β = .90

Cost considerations

With limited research funding, it is desirable to consider the cost and effectiveness issues during the planning stage. In addition, the costs of obtaining subjects of treatment and control groups are not necessarily the same. Suppose that c 1 and c 2 are the costs per subject in the first and second groups, respectively; then, the total cost of the experiment is C = c 1 N 1 + c 2 N 2. The following two questions arise with considerable frequency in sample size determinations. First, given a fixed amount of money, what is the maximum power that the design can achieve? Second, assuming a preferred degree of power, what is the design that costs the least? In both cases, equal sample sizes for the two groups do not necessarily yield the optimal solution (Allison et al., 1997). Consequently, optimally unbalanced designs are more efficient, and a detailed and systematic approach to sample size allocation is required.

With the simplified asymptotic approximation of Welch’s test \( V\dot{ \sim }N(\delta, { 1}) \), the optimal allocation isobtainedfor the prescribed two scenarios when the ratio of the sample sizes assumes the equality

$$ \frac{{{N_2}}}{{{N_1}}} = \theta, $$
(6)

where \( \theta = {\sigma_{{2}}}c_1^{{1/2}}/\left( {{\sigma_1}c_2^{{1/2}}} \right) \). However, the exact distribution of V given in Eq. 2 involves a beta mixture of noncentral t distributions. Thus, the associated properties can be notably different from a normal distribution for finite sample sizes. It is understandable that the particular identity of Eq. 6 will give a suboptimal result when the sample sizes are small. Such a phenomenon is demonstrated in the following illustration.

Total cost is fixed and actual power needs to be maximized

To develop a systematic search for the optimal solution, the aforementioned normal approximation is utilized as the benchmark in the exploration. It can be shown,under a fixed value of total cost C, that the maximum power is obtained with the sample size combination

$$ \,\,\begin{array}{*{20}{c}} {{N_{{{1}Z}}} = \frac{{C\left( {{\sigma_1}c_2^{{1/2}}} \right)}}{{{c_1}\left( {{\sigma_1}c_2^{{1/2}}} \right) + {c_2}\left( {{\sigma_2}c_1^{{1/2}}} \right)}}{\text{and}}} \\ {{N_{{{2}Z}}} = \frac{{C\left( {{\sigma_2}c_1^{{1/2}}} \right)}}{{{c_1}\left( {{\sigma_1}c_2^{{1/2}}} \right) + {c_2}\left( {{\sigma_2}c_1^{{1/2}}} \right)}}} \\ \end{array} . $$
(7)

It is easy to see that \( {c_{{1}}}{N_{{{1}Z}}} + {c_{{2}}}{N_{{{2}Z}}} = C \) and \( {N_{{{2}Z}}}/{N_{{{1}Z}}} = \theta \), as in Eq. 6. But in practice, the sample sizes need to be integer values, so the use of discrete numbers introduces some in exactness into the cost analysis. To find the proper result, a detailed power calculation and comparison are performed for the sample size combinations with N 1from N 1min to N 1max and \( {N_{{2}}} = {\text{Floor}}\left[ {(C - {c_{{1}}}{N_{{1}}})/{c_{{2}}}} \right] \), where \( {N_{\text{1min}}} = {\text{ Floor}}\left( {{N_{{{1}Z}}}} \right) - {1} \), \( {N_{\text{1max}}} = {\text{ Floor}}[\{ C - {c_{{2}}}({\text{Floor}}\left( {{N_{{{2}Z}}}} \right) - {1})\} /{c_{{1}}}] \), and the function Floor(a) returns the largest integer that is less than or equal to a. Thus theoptimal sample size allocation is the one giving the largest power. Numerical results are given in Table 3 for (c 1, c 2) = (1, 1), (1, 2), and (1, 3) and fixed total cost C = 25, 30, 50, 100, and 180 in accordance with the five standard deviation settings of σ 1 and σ 2 reported in the previous two tables. Examination of the results in Table 3 reveals that the actual power for a given total cost deceases drasticallyas the unit cost c 2 increases from 1 to 3. Regarding the optimal allocation, the general formula for the sample size ratio presented in Eq. 6 does not hold in several cases. For example, the ratio \( {N_{{2}}}/{N_{{1}}} = {11}/{17} = 0.{6471} \) for (σ 1, σ 2) = (1, 1) and (c 1, c 2) = (1, 3) is slightly greater than the ratiocomputed with Eq. 6: \( \theta = { }\left( {{1}\cdot {{1}^{{{1}/{2}}}}} \right)/\left( {{1}\cdot {{3}^{{{1}/{2}}}}} \right){ } = { }0.{5774} \). It should be noted that Guo and Luh (2009, Eq. 20) give the same approximate sample size formulas as in Eq. 7. However, they did not discuss how to utilize the particular result to find the ideal sample sizes for a fixed cost. Also, the numerical demonstration of Guo and Luh (p. 291) did not provide a systematic search for the optimal solution, and the sample sizes reported in their exposition are not integers. Ultimately, the inexactness issue incurred by integer sample sizes in cost analysis is not addressed by Guo and Luh.

Table 3 Computed sample sizes (N 1, N 2) and actual power when the total cost is fixed with μ d = 1 andα = .05

Target power is fixed and total cost needs to be minimized

In contrast to the previous situation where costs were fixed, the strategy to accommodate both power performance and cost appraisal can be conducted by finding the optimal allocation for minimizing cost when the target power is pre-chosen. In this case, the large-sample theory shows that in order to ensure the nominal power while minimizing total cost \( C = {c_{{1}}}{N_{{{1}Z}}} + {c_{{2}}}{N_{{{2}Z}}} \), the best sample size combination is

$$ {N_{{{1}Z}}} = \frac{{\theta \sigma_1^2 + \sigma_2^2}}{{\theta \mu_d^2/{{\left( {{Z_{{\alpha /2}}} + {Z_{\beta }}} \right)}^2}}}{\text{and}}\,{N_{{{2}Z}}} = \frac{{\theta \sigma_1^2 + \sigma_2^2}}{{\mu_d^2/{{\left( {{Z_{{\alpha /2}}} + {Z_{\beta }}} \right)}^2}}}, $$
(8)

where θ is the optimal ratio in Eq. 6. It can be readily seen that \( {N_{{{2}Z}}}/{N_{{{1}Z}}} = \theta \) and \( \sigma_1^2/{N_{{1Z}}} + \sigma_2^2/{N_{{2Z}}} = \mu_d^2/{\left( {{Z_{{\alpha /2}}} + {Z_{\beta }}} \right)^2} \). Due to the discrete character of sample size, the optimal allocation is found through a screening of sample size combinations that attain the desired power while giving the least cost. The exact power computation and cost evaluation are conducted for sample size combinations with N 1 from N 1min to N 1max and a proper value of \( {N_2} \geqslant {\text{Floor}}\,\left[ {\sigma_2^2/\left\{ {\mu_d^2/{{\left( {{z_{{\alpha /2}}} + {z_{\beta }}} \right)}^2} - \sigma_1^2/{N_1}} \right\}} \right] \) satisfying the required power, where \( {N_{\text{1min}}} = {\text{Floor}}\left( {{N_{{{1}Z}}}} \right),{N_{\text{1max}}} = {\text{Ceil}}\left[ {\sigma_1^2/\left\{ {\mu_d^2/{{\left( {{z_{{\alpha /2}}} + {z_{\beta }}} \right)}^2} - \sigma_2^2/\left( {Floor\left( {{N_{{2Z}}}} \right) - 1} \right)} \right\}} \right] \), and the function Ceil(a) returns the smallest integer that is greater than or equal to a. Thus, the optimal sample size allocation is the one giving the smallest cost while maintaining the specified power level. In cases where there is more than one combination yielding the same magnitude of least cost, the one producing the larger power is reported. Table 4 provides the corresponding optimal sample size allocation, cost, and actual power for the configurations of (c 1, c 2) = (1, 1), (1, 2), and (1, 3) and the five standard deviation settings of σ 1 and σ 2 in the preceding tables. It is clear that the total cost for a required power and fixed standard deviations increases substantially as the unit cost c 2 changes from 1 to 3. Again, the sample size ratios are close to, but different from, the approximate ratio θ. The largest discrepancy occurs with the case \( {N_{{2}}}/{N_{{1}}} = {16}/{6} = {2}.{6667} \) for (σ 1, σ 2) = (1/3, 1) and (c 1, c 2) = (1, 1), where as the counterpart ratio \( \theta = ({1}\cdot {{1}^{{{1}/{2}}}})/({1}/{3}\cdot {{1}^{{{1}/{2}}}}) = {3} \).

Table 4 Computed sample sizes (N 1, N 2), cost, and actual power when the total cost needs to be minimized with target power 1 –β = .90, μ d = 1, and α = .05

To demonstrate the advantage and importance of the exact technique, we alsoexamine the theoretical and empirical properties of the approximate methods of Schouten (1999) and Singer (2001). Accordingly, Schouten’s (p. 90) formulas are based on the normal approximation and give the identical approximate estimates N 1Z and N 2Z as defined in Eq. 8. In view of the approximate t distribution of the Welch’s test statistic V defined in Eq. 1, Singer (Eq. 2) suggested a modification of Eq. 4 by replacing the percentiles of standard normal distribution with those of a t distribution with degrees of freedom \( \hat{v} \). Specifically, it requires an iterative process to find the smallest integer that satisfies the inequality

$$ {N_{{1S}}} \geqslant \left( {\sigma_1^2 + \sigma_2^2/{r_s}} \right){\left( {{t_{{\hat{v}}}}_{{,\alpha /2}} + {t_{{\hat{v}}}},\beta } \right)^2}/\mu_d^2, $$
(9)

where r S = N 2S /N 1S . However, Singer did not provide any analytical justification for this alternative expression. Essentially, the naive formulation of Eq. 9 is questionable for lack of theoretical explanation. It is well known that if Z ~ N(0, 1), then \( X = { }(Z + \mu ){ }\sim N(\mu, { 1}) \), where μ is a constant. This particular result and related properties yield the approximate formulas in Eq. 8. On the other hand, the linear transformation of the normal distribution does not generalize to the case of the t distribution; that is, if t~ t(df), then \( Y = (t + \mu ) \) does not follow a noncentralt distribution t(df, μ) with a noncentrality parameter μ and degrees of freedom df. Actually, a random variable Y is said to have a noncentral t distribution t(df, μ) if and only if \( Y = (Z + \mu )/{\left( {W/df} \right)^{{{1}/{2}}}} \), where Z ~ N(0, 1), W ~ χ2(df), and Z and W are independent (Rencher, 2000, pp. 102–103). This may explain the fact that direct substitution of standard normal percentiles with those of t distribution was rarely described in the literature of sample size methodology. Instead, an iterative search is required to resolve the issue for statistical reasoning and exactness. Nevertheless, Guo and Luh (2009) applied Eq. 9 with r S = θ to determine optimal sample sizes when target power is fixed and total cost needs to be minimized.

For the purpose of comparison, we performed an extensive numerical examination of sample size calculations for the model settings in Table 4 of Guo and Luh (2009). To our knowledge, no research to date has compared the performance of the available approximate procedures with the exact method. All the sample sizes, cost, and corresponding actual power of the two approximate methods of Schouten (1999) and Singer (2001) and the exact approach are presented in Table 5. For target power 1 -β = .80, μ d = 1, and α = .05, a total of 24 model settings are examined according to the combined configurations of standard deviation ratio (σ 1:σ 2 = 1: 1 and 1: 2) and unit cost ratio (c 1:c 2 = 1: 2, 1: 1, and 2: 3) for \( \sigma_1^2 = 1.00 \), 2.15, 1.46, and 4.18. The sample sizes computed by Schouten’s method are denoted by N 1Z and N 2Z , whereas the sample sizes N 1S and N 2S listed in Table 5 for the procedure of Singer are exact replicates of those presented for the untrimmed case in Table 4 of Guo and Luh. The corresponding exact sample sizes computed with the suggested approach are expressed as N 1E and N 2E .

Table 5 Computed sample sizes (N 1, N 2), cost, and actual power for different procedures when the total cost needs to be minimized with target power 1 –β = .80, μ d = 1, and α = .05

It can be readily seen from Table 5 that there are discrepancies between the approximate and exact procedures. First, the normal approximation or Schouten’s (1999) method is misleading because only 4 out of 24 cases have attained the target power level of .80 (cases 4, 6, 12, and 24). Thus, the sample sizes N 1Z and N 2Z are generally inadequate. For the four occasions that meet the minimum power requirement, the resulting costs of cases 6, 12, and 24 are larger than those of the exact approach. Again, the reported sample sizes N 1Z and N 2Z are not optimal. Accordingly, case 4 is the single instance that agrees with the exact result. On the other hand, all the sample sizes N 1S and N 2S associated with Singer’s (2001) method satisfy the necessary minimum power .80. While there are seven occurrences (cases 2, 4, 8, 14, 15, 19, and 20) that match the exact results, the other 17 sample size N 1S and N 2S combinations suffer the disadvantage of incurring higher cost than the optimal selections N 1E and N 2E . In view of these empirical evidences, it is clear that the existing approximate procedures of Schouten and Singer are not accurate enough to guarantee optimal sample sizes and, therefore, the procedures presented in Eqs. 8 and 9 are not recommended.

Furthermore, Lee (1992) examined the same problemwithout considering the differential unit cost per subject in the two groups, and this can be viewed as a special case of the presentation here with c 1 = c 2 = 1. Accordingly, his algorithm for determining the optimal sample sizes is questionable. For example, when σ 1 = σ 2 = 1, the reported sample sizes are N 1 = N 2 = 23 with total cost = total sample size = 46, and actual power is .9121. In contrast, our computation gives N 1 = 23 and N 2 = 22, with total cost = total sample size = 45, and attained power is. 9057. Therefore, to maintain the least target power level of .90, it requires only a total of 45 sample sizes, rather than the sizes of 46 as reported by Lee. Consequently, it is worthwhile conducting the suggested exact sample size computations.

Numerical example

To demonstrate the features crossing different allocation constraints and cost considerations in sample size planning, the comparison of ability tests administered online and in the laboratory of Ihme et al. (2009) is used as an example. The test scores collected online and offline are assumed to have normal distributions with different variances, because the demographical structure of online samples can differ from that of offline samples acquired in conventional laboratory settings. To illustrate sample size determination for design planning, the results of Ihme et al. are modified to have the underlying population parameter values of μ Lab = 11, μ Online = 10, σ Lab = 2.3, and σ Online = 2.7. It is clear that online testing has the advantages of ease of obtaining a large sample and low cost. Thus, it may be desirable to set the sample ratio as N Online/N Lab = 4/1, which would imply that the sample sizes required to attain power .90 at the significance level .05 are N Lab = 76 and N Online = 304. In case in which the sample size N Online is selected as 400, the offline group needs sample size N Lab = 71 to meet the same power and significance requirements. However, it is important to take budget issues into account. Assume that the available total cost is set as C = 100 and the respective unit costs per subject are c Lab = 1 and c Online = 0.2. The optimal sample size solutionis N Lab = 65 and N Online = 175, which has an actual power of .8079. On the other hand, to attain the pre-assigned power of .90, the design must have the sample size allocation as N Lab = 86 and N Online = 224, which amounts to the budget of C =130.8. Such information may be useful for investigators to justify the design strategy and financial support. Although they did not address the sample size calculation, the reader is referred to Ihme et al. for further details about online achievement tests.

Conclusion

The problem of testing the equality of the means from two independent and normally distributed populations with unknown and unequal variances has beenwidely considered in the literature. The distinctive usefulness of Welch’s (1938) test in applications further occasions methodological and practical concerns about the corresponding procedures for sample size determination. Computationally, the use of computers and the general availability of statistical software permit inherent requirements for exact analysis. In view of the importance of sample size calculations in actual practice and the limited features of available computer packages, the corresponding programs are developed to facilitate the usage of the suggested approaches. Intensive numerical integration and incremental search are incorporated in the presented computer algorithms for finding the optimal solutions for different design requirements. Furthermore, various sample size tables are provided to help researchers have a better understanding of the inherent relationship that exists between the planned sample sizes conditional on the model configurations. The proposed sample size procedures enhance and expand the current methods and should be useful for planning of research in two-group situations where variances and costs per subject both differ across groups.