Introduction

Research on ovulatory effects in humans spans a broad range of behaviors. For example, there is evidence that women’s fertility status predicts differences in their self-ornamentation (e.g., Beall & Tracy, 2013; Haselton, Mortezaie, Pillsworth, Bleske-Rechek, & Frederick, 2007), perceived attractiveness and desirability (e.g., Roberts et al., 2004; Schwarz & Hassebrauck, 2008), earnings from lap dances (Miller, Tybur, & Jordan, 2007), frequency of sexual intercourse (e.g., Bullivant, et al., 2004; Wilcox et al., 2004), and shifts in attraction to traits indicative of men’s masculinity (e.g., Penton-Voak & Perrett, 2000). Similarly, men have been found to engage in more mate-guarding behavior (e.g., Gangestad, Thornhill, & Garver, 2002; Pillsworth & Haselton, 2006) and have differential testosterone production (e.g., Miller & Maner, 2010) in response to women’s fertility status. However, despite this breadth of effects presumably related to fertility, criticisms of the field, its research and its theories, abound.

For example, ongoing criticisms of the ovulatory shift hypothesis (Harris, 2011, 2013; Harris, Chabot, & Mickes, 2013; Harris, Pashler, & Mickes, 2014) contend that inconsistent methodologies, failures to replicate findings (e.g., Peters, Simmons, & Rhodes, 2009), and recent meta-analytic work concluding ovulatory effects are not robust (i.e., Wood & Joshi, 2011; Wood, Kressel, Joshi, & Louie, 2012a, b, 2014) indicate that research supporting the ovulatory shift hypothesis are the result of spurious findings or inflated researcher degrees of freedom (Simmons, Nelson, & Simonsohn, 2011). These accusations are extreme, and probably false given a rebuttal meta-analysis by Gildersleeve, Haselton, and Fales (2014a) and p-curve analysis by Gildersleeve, Haselton, and Fales (2014b). However, these criticisms were not completely unfounded. That is, despite recommendations of ideal methodological practices when addressing failures to replicate (e.g., DeBruine et al., 2010), there is generally an inconsistent use of ovulation estimation methods, between- versus within-subjects designs, definitions of fertility windows, and methods employed to estimate fertility in the literature (Harris, 2013; Harris et al., 2013, 2014; Gildersleeve et al., 2014b). There can be very good reasons for methodological variability in this area of research (e.g., Gildersleeve et al., 2013), but such variability in lieu of recommended practices can give the appearance of post-hoc methodological modification.

Consider Gildersleeve et al.’s (2014a) sample of studies specific to the ovulatory shift hypothesis—the Gildersleeve et al. and Wood et al. (2014) samples are very similar so we give preference to Gildersleeve et al. as they are expert contributors and proponents of this body of research. Despite notions of methodological best practices, between-subjects group designs were used in 43.7 % of studies, and between-subjects continuous measures were used in 18.3 %. Thus, 62 % of all studies reviewed employed a between-subjects design. This means that, despite the supposed strength of repeated measure methods, within-subjects designs were only employed in 38 % of the studies reviewed.

Furthermore, with respect to fertility estimation, both forward-counting and fertility-likelihood overlay methods (e.g., Wilcox, Dunson, Weinberg, Trussell, & Baird, 2001) were employed heavily in the reviewed literature. Specifically, continuous overlays were used in 18.3 % of all studies, and account for 23.9 % of studies that used forward counting and 11.1 % of studies that used some form of backward counting. Of those studies that employed backward counting (25.4 % of all studies), only 33.3 % confirmed menstrual onset to ensure that backward estimates were derived from the correct start date; i.e., only 8.5 % of all reviewed studies confirmed menstrual onset. Finally, only six studies (8.5 % of all studies) employed some sort of assay of hormones, five of which used luteal hormone (LH) testing (7.1 % of all studies).

It may be unfair to evaluate the methodology of the entire sample of the ovulatory-shift literature given as a single period given that recommended practices emerge as the field develops. For example, treating Debruine et al.’s (2010) commentary on the methodological limitations of Harris’ (2011) replication attempt as an indicator of the prevalence of methodological ideals in ovulatory-shift research suggests some shifts in methodological focus. Specifically, before 2010, forward-counting methods were used in 69.8 % of studies and backward-counting methods were used in 22.6 % of studies. From 2010 onward, forward-counting utilization declined to 50 % of studies, and backward-counting methods rose to 33.3 % of studies. Similarly, between-subjects designs were used more often than within-subjects designs (64.2 % and 35.8 %, respectively) before 2010, but from 2010 onward the use of between-subjects designs declined and within-subject designs increased (55.6 % and 44.4 %, respectively). Finally, comparing pre-2010 studies with 2010 and onward studies there were small increases in the relative frequency of studies confirming menstruation onset when using backward counting (7.5 % and 11.1 %, respectively), and in the use of hormonal estimation of ovulation (7.5 % and 11.1 %, respectively).

While these shifts in methodological practices are encouraging, it is clear that there is still a heavy reliance on counting methods (generally), despite evidence and arguments that other approaches (e.g., LH testing) may be more effective (e.g., Bullivant et al., 2004; Debruine et al., 2010), and on both forward and pseudo backward counting (specifically), despite evidence that these procedures are likely more error prone than backward counting with confirmation of next menstrual cycle onset (e.g., Fehring, Schneider, & Raviele, 2006). Utilization of these methods is problematic because increased error in measurement due to misidentification of ovulation and fertile periods may diminish power to detect phase-related behavioral shifts. Since both the follicular and luteal phases are variable (e.g., Fehring et al., 2006), both between and within women (Creinin, Keverline, & Meyn, 2004; Fehring et al., 2006), static assumptions of cycle length—such as when using counting methods—reduces ovulation estimation accuracy by 37-57 % (Howards et al., 2008). This problem is further compounded when estimation relies on self-reported cycle length (i.e., pseudo backwards counting) since self-report of cycle length has been shown to be inaccurate in women (e.g., Small, Manatunga, & Marcus, 2007).

The present study

While continuing the academic debate over which meta-analysis is correct (Gildersleeve et al., 2014b; Wood & Carden, 2014) makes for “good theater” (Ferguson, 2014), it is plausible that methodological limitations can account for many of the inconsistencies in these findings (e.g., Harris, 2011, 2013; Wood et al., 2014). Comments regarding methodological limitations and recommendations (e.g., Debruine et al., 2010; Gildersleeve et al., 2013) reflect the following: (a) within-subjects designs are better than between-subjects designs; (b) forward counting is less precise than backward counting, and both are less accurate than hormonal assessments; (c) optimum methods are more expensive with respect to time and money; and (d) with a sufficiently large sample less optimum methodology is expected to overcome method-based reductions in power. However, it is not immediately clear how these reasonable expectations may actually function in research. Specifically, it is not well understood how much of a decrement in power is observed when using different methods (e.g., between-subjects vs. within-subjects; counting vs. LH testing) or when these differences are negligible rather than pronounced.

To better understand the relation of power and methodology as it pertains to ovulatory effects we evaluate the following conditions for group-based mean comparisons using simulated data: (a) ovulation estimation method (LH testing vs. forward counting vs. unconfirmed backward counting vs. confirmed backward counting); (b) between- vs. within-subject designs; (c) fertile window length (day of peak fertility vs. sample from six-day window vs. sample from nine-day window); and (d) sample size (N = 10, 20, …, 200). Additionally, we evaluate the ability of fertility overlays (i.e., Wilcox et al., 2001) to detect behavioral variability correlated with predicted fertility fluctuations as a function of sample size. Finally, we evaluate differences in fertile days selected as a function of ovulation estimation method using counting methods in both simulated and empirical data sets.

Method

Simulation study

Data simulation procedure

In the present study we simulated a population of 20,000 ovulatory cycles. Each simulated cycle represented a single cycle of an individual woman. All cycles were made up of daily scores. Cycles were generated using estimates of the mean, variance, and covariance of cycle-phase lengths (i.e., menses, follicular, and luteal) previously reported (Fehring et al., 2006; R. J. Fehring, personal communication, 7 April 2014) (see Table 1). The Fehring et al. (2006) sample statistics were used to inform the present simulation for several reasons: (a) its frequency of citation in ovulatory research generally; (b) its frequency of citation in ovulatory-shift research specifically; (c) its citation as evidence of average cycle phase length, variability between and within women, and relative stability of luteal phase lengths compared to the follicular phase (e.g., Garver-Apgar et al., 2008; Gildersleeve et al., 2014a; Larson et al., 2013; Lukaszewski & Roney, 2009; Miller & Maner, 2010; Oinonen & Mazmanian, 2007; Prokosch, Coss, Scheib, & Blozis, 2009; Roney et al., 2011; Rosen & López, 2009; Schwarz & Hassebrauck, 2008); and (d) Its specific use as justification for using backward counting, rather than forward, to estimate ovulation and the fertile window (e.g., Oinonen & Mazmanian, 2007; Prokosch et al., 2009; Roney et al., 2011; Rosen & López, 2009; Schwarz & Hassebrauck, 2008). Taken together, it is apparent that researchers in the field value Fehring et al.’s work as sufficiently reliable to inform their own research methodology decisions. We therefore decided it represented a sound source for the present study’s cycle simulation parameters.

Table 1 Simulation target values and observed simulation values

Shifts in a hypothetical outcome (e.g., attraction to male traits) were then simulated for each cycle following four distinct but similar hypothetical process trajectories (see Fig. 1).

Fig. 1
figure 1

Examples of Processes A–D

Process A consists of a single stepwise (i.e., on-off) behavioral shift corresponding to a six-day peak fertility window ending on the day of ovulation (e.g., Dunson, Baird, Wilcox, & Weinberg, 1999; Dunson, Colombo, & Baird, 2002; Wilcox, Weinberg, & Baird, 1995; 1998); across the six-day window the degree of behavioral change is equal. Process B is identical to Process A, except it has a second stepwise behavioral shift during the mid-luteal phase. This mid-luteal shift is meant to represent a secondary shift sometimes observed in ovulatory effects research (e.g., Miller et al., 2007) and thought to reflect dependency on hormonal processes observed across the ovulatory cycle (e.g., Roney & Simmons, 2013). The degree of the secondary shift, relative to the first, was determined by dividing the predicted mid-luteal estrogen to progesterone ratio by the average peak fertility estrogen to progesterone ratio—predicted hormone values reported by Stricker et al. (2006). This results in a mid-luteal secondary shift that is 30 % of the fertile phase maximum value. Process C is like Process A in that it has a single shift corresponding to the six-day peak in fertility. However, unlike Process A, it has a continuous curvilinear trajectory such that each day’s weight of behavioral score increase is equivalent to the daily projected estrogen value divided by the predicted estrogen value for the day of peak fertility—this is calculated across the six-day window (e.g., Dunson et al., 2002; Wilcox et al., 1995; 1998). Finally, Process D is similar to Process B in that it has two surges in change, a maximum corresponding to the fertile phase and a secondary minor shift corresponding to the mid-luteal phase. Additionally, weighting of behavioral shifts are based on hormonal ratios across the entire ovulatory cycle, and not just restricted to peak fertility and mid-luteal phases. It is unclear what hormonal processes may be associated with shifts in behavior, and whether any single hormonal process is equally predictive across behavior types. To this end, we considered the average trajectory of two hormonal processes: (a) the ratio of each day’s predicted estrogen to progesterone level divided by the predicted estrogen to progesterone level for the day of peak fertility; and (b) the predicted daily level of estrogen divided by the predicted level of estrogen on the day of peak fertility. The daily weights generated using these two approaches were averaged together and the resulting aggregate daily weights were used to model average daily behavioral scores.

For each trajectory we created six conditions of maximum mean value (M max = 0.15, 0.25, 0.5, 1.0, 1.5, and 2.0) for days of peak fertility. Maximum mean values represent the maximum behavioral fluctuation value possible based on the daily weight. For non-fertile days, including mid-luteal days without secondary behavioral shifts,Footnote 1 the mean expected score was zero. Across all participants and days, variability of behavioral scores was held constant (SD = 1.0), with the goal being that within-cycle effect sizes would then be equal to the difference between the sampled high and low fertility day behavioral scores (i.e., [High Fertility Score – Low Fertility Score] / 1) . This generating procedure resulted in an average within-cycle variability of 1.05 (SD = 0.14, 95 % CI 0.78–1.32) for behavior scores. In addition to generating behavioral shifts, a null behavioral change score was generated where no change across the cycle occurred (M max = 0.00, SD = 1.0). Due to variability in data generation (i.e., SD = 1.0), and variability in degree of relative differences between peak fertility and mid-luteal phases (e.g., Processes A & C compared to Processes B & D) observed effect sizes in the population are attenuated from generating M max values (see Table 2). Finally, for between-subjects designs each case had a random mean score (M = 5.5, SD = 1.0) added to each of their daily scores to reflect individual differences in average scores.Footnote 2

Table 2 Observed population level effect sizes by process type, max mean, estimation method, and fertile window size

Estimation of fertility

For each simulated ovulatory cycle the true phase lengths and corresponding ovulatory days (e.g. days −2, −1, 0, 1, and 2) were known, and correspond with days of ovulation predicted using fertility monitors (i.e., Fehring et al., 2006). We estimated day of ovulation for each simulated cycle using forward counting and confirmed backward counting—when the date of next menstrual onset is known. For these two methods we assume accuracy of prior and later menstrual onset for forward and backward counting, respectively. We also estimated day of ovulation using a pseudo backward counting method where participants had a 50 % chance of accurately predicting next menstrual onset, a 40 % chance of incorrectly indicating next menstrual onset by ±1 day, and a 10 % chance of incorrectly indicating next menstrual onset by ±2 days. This method adds a relatively small error component to self-report or prediction of next menstrual onset—as is often observed in studies that evaluate discrepancies between expected and observed future menstrual onset dates (e.g. Wilcox, Dunson, & Baird, 2000).

Wood et al. (2014) and Gildersleeve et al. (2014a, b) came to different conclusions about the effect of fertile phase window width on statistical conclusions in the extant literature. By using very precise to less precise windows under controlled conditions we assess what effect, if any, fertile window width can have on statistical power. Three different peak fertility sampling methods were applied using each of the fertility estimation method (e.g. forward and pseudo backward) ovulatory calendars: (a) the day of peak fertility (cycle day −1); a random day from the six-day fertile window (cycle days −5 through 0); and (c) a random day from a nine-day fertile window (cycle days −8 through 0)Footnote 3 (see Table 2). Thus, we have a degree of precision reflecting the day of greatest expected change, a random day from the entire six-day fertile window, and a random day from a nine-day fertile window to determine what influence fertility window size and estimation method have on power and effect size. To calculate a non-fertile period, we selected from the mid-luteal phase. Specifically, we used cycle day 7 for each method of estimating ovulation.Footnote 4

Additionally, we applied Wilcox et al.’s (2001) fertility estimation overlays—all, regularly, and irregularly cycling women—to each cycle, yielding a daily score of expected conception likelihood for each simulated cycle. Consistent with the predominant application of fertility overlays in between-subjects studies, these fertility overlays were treated in a forward-counting manner. That is, the first day of each cycle corresponded to the first day of the overlay, and so on. However, Wilcox et al. (2000) indicate that variability in average cycle length relates to variability in occurrence of fertile days. This suggests that aligning fertility overlays based on other estimates of fertility—rather than treating it as a consistent forward-counting estimate of fertility—may improve correspondence of fertility overlays and person-specific fertility—as well as fertility-related behavioral shifts. That is, if days of estimated fertility are aligned with estimated fertile days (e.g., using backward counting or LH testing), then the overlay may better correspond with actual fertility status. To test this we aligned overlays for all simulated cycles such that peak overlay fertility scores corresponded with the estimated day of peak fertility using forward, backward, and pseudo backward-counting methods, and true estimation.

Procedure for testing effects

For between-subjects t-tests of sample size N (i.e., 10, 20, …, 200), N random participants’ peak fertility scores were drawn. Then an independent and random sample of N additional participants’ mid-luteal scores was drawn. Mean phase differences (i.e., peak fertility compared to mid-luteal), pooled SD, effect size (Cohen’s d), t-value, and p-value were calculated, then determination of whether the correct statistical decision was made. This process was repeated across all behavioral shift trajectories, ovulation estimation methods, and fertile window-width conditions. The within-subject t-test procedure was identical to the between-subject t-tests except that N random participants’ peak fertility scores were drawn and matched to the participants’ corresponding mid-luteal scores.Footnote 5 The procedure for testing between-subject correlations using fertility overlays consisted of sampling a random cycle day from N randomly selected cycles. Cycle-day behavior scores were matched with corresponding fertility overlay scores. The resulting samples were tested using the Pearson correlation where r(N-2) served as the effect size. Similar to the t-test procedure, we concluded by determining whether the correct statistical decision was made for each test of correlation.

Evaluating power using simulated data

For estimates of power using between- and within-subjects t-tests a correct statistical decision was defined using two criteria: (a) mean differences were in the same direction as the population (i.e., M HighFertility > M LowFertility); and (b) the effect was statistically significant using a one-tailed test (α = .05). All other results were scored as incorrect decisions. Power was defined as the number of correct decisions divided by the total number of tests.Footnote 6 Power estimates using correlations of fertility overlays were defined using the same criteria as t-tests with the exception that criterion (a) refers to correlation direction (i.e., positively related) rather than mean differences and criterion (b) uses two-tailed tests (i.e., α = .05 split above and below the estimated correlation value).Footnote 7

Empirical data

A publicly available dataset was used in the present study (Fehring, Schneider, Raviele, Rodriguez, & Pruszynski, 2013). In the present study, cycle inclusion from the public data set was contingent upon whether participants had been assigned to monitor fertility status using Clear Blue Easy Fertility Monitors (CBFM), and whether there was sufficient ovulatory cycle event data to characterize each cycle; i.e., CBFM indicated day of peak fertility, estimated day of ovulation, and both cycle onset and offset were recorded. These criteria resulted in the inclusion of 91 women’s cycles, each with 1–43 ovulatory cycles (N cycles = 906, M cycles per woman = 9.98, SD cycles per woman = 7.37). For a full description of the study we refer readers to the original study and data hosting site (Fehring et al., 2013; e-Publications@Marquette, 2012).

Estimation of fertility

Estimates of ovulation based on CBFM results were used to determine six-day periods of peak fertility (e.g., Dunson et al., 1999, 2002; Wilcox et al., 1995, 1998) which were then used as a comparison to forward and confirmed backward counting methods.Footnote 8 These estimation methods were then compared with CBFM by determining the frequency with which each method corresponded with CBFM indicated days of peak fertility. These comparisons included when considering capturing the day of expected peak fertility and any of the six days of peak fertility. This was considered for each estimation method when using the predicted day of peak fertility, a six-day window of predicted peak fertility, and a nine-day window of peak fertility.

Results

Predicting power

A total of 11,520 power estimates were generated for between- (n = 5,760, M = .44, SD = .36) and within- (n = 5,760, M = .51, SD = .37) subject t-tests. A t-test was used to determine that, generally, within-subjects designs, compared to between-subjects designs, had greater power to detect fertility-based differences (t(11518) = 11.17, p < .001, d = .21).

To test and control other simulation conditions a Factorial ANCOVA model was fit using type II sums of squares and the following additional predictors: (a) six-day and nine-day windows compared against day of predicted peak fertility; (b) pseudo backward, confirmed backward, forward counting and true estimation; (c) sample size; (d) a four-way interaction of within- versus between-subjects designs, fertility estimation method, window size, and sample size as well as all three-way and two-way interactions of these terms; and (e) average effect size as a control. This second model (see Table 3) explained a significant proportion of variance in power (F(48, 11,471) = 710.1, p < .0001, R 2 = .748). It is important to note that in this second model the increase in power when using within-subjects tests (B = .10, SE = .02, p < .001) is consistent with our initial t-test results—after accounting for other simulation conditions.

Table 3 Factorial ANCOVA table predicting power to detect ovulatory effects

Dummy-coded contrasts were used to investigate the interaction between fertility estimation method and within- versus between-subjects designs. Results indicate that the interaction was driven by a reduction in within-subjects designs’ power, relative to between-subjects designs, when comparing forward counting with both pseudo backward- and backward-counting fertility estimation methods (B = −.07 and −.07, SE = .035 and .35, p = .046 and .047, respectively)—also note that within-subjects designs were still more powerful than between-subjects designs, despite this reduction. This suggests that the benefit of using within-subjects designs is reduced when coupled with forward-counting estimation, but this effect was only significant when comparing forward- and backward-counting methods.

To understand the interaction between fertility estimation method and fertility window size, data was subset by fertility estimation type and a one-way ANOVA was conducted for each subset where power was predicted by window size. Results indicate that fertility window size was a significant predictor of power when using forward (F(2, 2,877) = 16.26, p < .001, R 2 = .01), pseudo backward (F(2, 2,877) = 4.87, p < .01, R 2 = .003), backward (F(2, 2.877) = 4.76, p < .01, R 2 = .003), and true ovulation estimation methods (F(2, 2.877) = 75.01, p < .001, R 2 = .05). It is evident that the effect of window size was larger for true ovulation estimation relative to the other estimation procedures. Pairwise comparisons using the Bonferroni correction were used to test if there were specific differences in the pattern of window effects within each ovulation estimation method. Results indicated several key conclusions: (a) 1-day estimates of peak fertility yielded significantly more power than either six-day or nine-day windows; (b) there is a significant difference in power between six-day (M = .63, SD = .37) and nine-day (M = .50, SD = .37) windows for true ovulation estimation (t(1,918) = 7.71, p < .001), but not for any of the other ovulation estimation methods; and (c) effect size patterns were larger for the true ovulation estimation condition relative to all other estimation methods, and for forward counting estimation relative to pseudo backward and backward counting (see Table 4).

Table 4 Effect sizes comparing power between window size conditions within fertility estimation methods

Investigation of the interaction between sample size and fertility estimation method was completed using dummy-coded estimation variables. The interaction effect was driven by a difference in the rate of change in power as a function of sample size between pseudo backward counting (B = .007, SE = .002, p < .001), backward counting (B = .007, SE = .002, p < .001), and true ovulation (B = .004, SE = .002, p = .04) with forward counting. These results indicate that power increases faster for pseudo backward, backward, and true ovulation methods. For example, with sample sizes of 20, 50, and 100 pseudo backward and backward (additional power = .14, .35, .70) and true ovulation (additional power = .08, .20, .40) would have more power relative to forward estimation.

Comparing estimation methods

To understand differences in power between estimation methods we evaluated the frequency that peak fertility days, defined by true ovulation, were identified using different fertility estimation counting methods when applied to both the simulated population data and the Fehring et al. (2013) data (see Table 5). Comparisons were made with respect to correct and incorrect selection of the day of peak fertility, and then of any six fertile days in a cycle. Correct percentages were based on how many target (i.e., true fertile) days were captured divided by the total possible (e.g., 20,000 possible days of peak fertility in the simulated data). Incorrect percentages were based on how many non-target days were captured divided by the total number of captured days (e.g., 180,000 total days when using a nine-day window). To evaluate efficacy, the one-day predicted peak fertility and both six- and nine-day windows of peak fertility were used.

Table 5 Percent of true fertile days captured using different fertility windows and estimation methods

Generally, accuracy at predicting the true day of peak fertility was very poor when using the one-day and six-day windows. While inclusion of the true day of peak fertility did increase considerably using the six-day window, these estimates were accompanied by a comparable proportion of incorrectly identified days (compared with the one-day window). However, it should be noted that when considering the number of total true fertile days (six days total), the error rates drop by 30–50 %; error declined greatest for pseudo backward and backward. Unsurprisingly, moving to a nine-day window necessarily caused both correct target identification and non-target identification percentages to increase relative to the six-day window. Forward counting consistently underperformed compared to both backward methods, and pseudo backward slightly underperformed relative to backward.

Efficacy of fertility estimation overlays

Generally, the fertility overlay for all participants (i.e., both regularly and irregularly cycling participants) yielded the largest effect sizes, but these gains were minor with respect to the overlay for regularly cycling participants (see Table 6). Nonetheless, we focus our interpretation on the overlay that had the best ability to detect the effect of fertility (i.e., the overlay for all participants). Across all processes, power only exceeded 80 % for the greatest maximum (max) mean condition (M max = 2.0) for Processes A and B (8.3 % of conditions; see Fig. 2). Even at a reduced power threshold of 50 %, only seven (29.2 %) of the considered conditions (i.e., M max = 2.0 for Processes A–D, M max = 1.5 for Processes A and B, and M max = 1.0 for Process A; see Fig. 3) met or exceeded the threshold. This means that for 70.8 % of the conditions considered, detection of the true effect would be less likely than a coin toss, even with a sample of 200 participants.

Table 6 Average Pearson correlation values by maximum difference, process type, and fertility overlay
Fig. 2
figure 2

Power trajectories for Processes A–B using Wilcox et al.’s (2001) fertility overlay for all participants across sample size. Power trajectories from the bottom to the top correspond with M max values of 0.15, 0.25, 0.5, 1.0, 1.5, and 2.0

Fig. 3
figure 3

Power trajectories for Processes C–D using Wilcox et al.’s (2001) fertility overlay for all participants across sample size. Power trajectories from the bottom to the top correspond with M max values of 0.15, 0.25, 0.5, 1.0, 1.5, and 2.0

This can be better understood by considering the average effects observed within individuals. A second set of analyses considered the relation of fertility overlay values with simulated behavioral shifts for each overlay type, effect size, and process type (see Table 7). Average cycle effects and power were relatively consistent across overlay types, but as with the between-cycle results, effect size and power diminished with process complexity. Even when power was greatest (i.e., Process A), the effect was not successfully detected in just over 50 % of the cycles—this disparity maximized at 78–80 % for Processes C and D.

Table 7 Average within-cycle correlation values and power to detect effects using fertility overlays

As previously mentioned, it may be beneficial to consider a cycle-centered approach with respect to overlay use. That is, if individual cycle-estimated fertility is used to center overlays then overlay efficacy (e.g., power) may increase. This approach was tested with the overlay for both regular and irregular cycles, using forward, pseudo backward, backward, and true ovulation to center overlays. Improvements in power across estimation types are striking. For Processes A and B, M max values of 2.0 and 1.5 exceed 80 % power with a sample size of 200 for pseudo backward, backward, and true ovulation estimation. Given a M max value of 2.0, forward estimation also exceeds 80 % with a sample size of 200 for Processes A and B (see Figs. 4, 5, 6, and 7).

Fig. 4
figure 4

Power trajectories for Processes A–B using Wilcox et al.’s (2001) fertility overlay for all participants centered using forward counting estimated peak fertility. Power trajectories from the bottom to the top correspond with M max values of 0.15, 0.25, 0.5, 1.0, 1.5, and 2.0

Fig. 5
figure 5

Power trajectories for Processes A–B using Wilcox et al.’s (2001) fertility overlay for all participants centered using pseudo backward counting estimated peak fertility. Power trajectories from the bottom to the top correspond with M max values of 0.15, 0.25, 0.5, 1.0, 1.5, and 2.0

Fig. 6
figure 6

Power trajectories for Processes A–B using Wilcox et al.’s (2001) fertility overlay for all participants centered using backward counting estimated peak fertility. Power trajectories from the bottom to the top correspond with M max values of 0.15, 0.25, 0.5, 1.0, 1.5, and 2.0

Fig. 7
figure 7

Power trajectories for Processes A–B using Wilcox et al.’s (2001) fertility overlay for all participants centered using true peak fertility. Power trajectories from the bottom to the top correspond with M max values of 0.15, 0.25, 0.5, 1.0, 1.5, and 2.0

Similarly, for Processes C and D, increased power has resulted in exceeding the 80 % power threshold for the M max value of 2.0 using both backward and true ovulation (see Figs. 8, 9, 10, and 11). Further, 45.8 % of true ovulation conditions, 41.7 % of pseudo backward and backward conditions, and 25 % of forward conditions now exceed 50 % power given a sample size of 200. While these represent marked improvements in power, the number of conditions that can achieve power of 80 % given a maximum sample size of 200 is, at best, almost 50 %, and is favored by the largest effect and sample sizes considered.

Fig. 8
figure 8

Power trajectories for Processes C–D using Wilcox et al.’s (2001) fertility overlay for all participants centered using forward counting estimated peak fertility. Power trajectories from the bottom to the top correspond with M max values of 0.15, 0.25, 0.5, 1.0, 1.5, and 2.0

Fig. 9
figure 9

Power trajectories for Processes C–D using Wilcox et al.’s (2001) fertility overlay for all participants centered using pseudo backward counting estimated peak fertility. Power trajectories from the bottom to the top correspond with M max values of 0.15, 0.25, 0.5, 1.0, 1.5, and 2.0

Fig. 10
figure 10

Power trajectories for Processes C–D using Wilcox et al.’s (2001) fertility overlay for all participants centered using backward counting estimated peak fertility. Power trajectories from the bottom to the top correspond with M max values of 0.15, 0.25, 0.5, 1.0, 1.5, and 2.0

Fig. 11
figure 11

Power trajectories for Processes C–D using Wilcox et al.’s (2001) fertility overlay for all participants centered using true peak fertility. Power trajectories from the bottom to the top correspond with M max values of 0.15, 0.25, 0.5, 1.0, 1.5, and 2.0

Discussion

Summary

In the present study we used a simulated population of ovulatory cycles and an empirical data set to demonstrate significant discrepancies between fertility overlays and counting methods when compared with more accurate hormonal measures to detect ovulation (e.g., Bullivant et al., 2004; Fehring et al., 2013; Lloyd & Coulam, 1989; Tanabe et al., 2001; Trussel, 2008). Furthermore, we demonstrated that between-subject methods (i.e., independent samples t-tests and fertility overlays) showed significantly less power to detect effects present in the population when compared to within-subjects approaches.

Power disparities between counting methods and true ovulation, with respect to detection of ovulatory effects, are likely due to differences in days identified as fertile. As error rates in fertile day identification increased, estimates of mean differences tended to be less than those observed at the population level. This was particularly true for curvilinear processes C and D because their apex corresponds with the population level maximum change in simulated behavior. In contrast, because processes A and B had stable differences across all fertile days, errors in fertile day identification did not diminish estimates of population mean differences to the same degree. As a result, while pseudo backwards and backwards estimation was underpowered compared to true ovulation, these differences were not as large across most sample sizes and maximum difference values when compared with discrepancies observed when using forward counting.

Differences in fertile day estimation using forward and backward counting methods compared with hormonal detection were replicated in our empirical example. We observed that both forward and backward counting methods differed in the days identified as fertile when compared to hormonal detection. These differences were consistent with those from our simulated data set, which was derived from data describing a different empirical sample (i.e., Fehring et al., 2006). Taken together, this lends support to the veracity of our simulated data, and conclusions drawn from it with respect to power estimates.

Implications

Several implications can be drawn from these findings; one of the most evident is that within-subjects t-tests are preferable when studying ovulatory effects compared with between-subjects t-tests. This is likely due to the inability to separate between-subject variability in scores from between-phase (high fertility vs. low fertility) differences when using between-subjects designs. While these conclusions do not contradict any typically expressed expectations, what is interesting about them is that with an average predicted increase in power of .1007 when using a within-subjects design, and a predicted increase in power of .0019 for each participant sampled, we would predict needing approximately 53 additional participants to achieve comparable power using a between-subjects design using hormonal, backward counting, or pseudo backward counting estimation—assuming self-report error in pseudo backward counting was not worse than that used in our simulation study. It should be noted that due to the significant interaction between fertility estimation methods and sample size it would take an additional 30 participants (in addition to the initial 53) to achieve equal power for a between-subjects design if using forward counting estimation.

Another implication is that, whereas backwards and pseudo backwards counting estimation outperformed forward counting estimation of ovulation, they appear to be inferior methods compared to hormonal assessments. While it may be argued that the inferiority of backwards estimation compared with other assessment methods is already known within the field (e.g., Bullivant et al., 2004; Debruine et al., 2010), the present state of the supporting literature suggests it is very popular (particularly pseudo backward counting; e.g., Gildersleeve et al., 2014a, b; Harris, 2013). Although it is true that larger and larger sample sizes would diminish power differences due to estimation method, the average sample size employed in a sample of the extant literature is a heavily skewed 95 participants (Mdn = 50, SE = 37.4; i.e., Gildersleeve et al., 2014a, b). Therefore, while larger sample sizes could diminish concerns over estimation effects on power, those studies employing more typical sampling procedures should take these effects into account during study design.

An important issue that has been raised is whether fertility window size may have an influence on the likelihood of detecting effects. Our simulated results suggest that fertility window size does have several effects. Most consistently, that using the single day of estimated peak fertility (i.e., cycle day −1) resulted in greater power to detect effects than using days sampled from either a six- or nine-day window, regardless of estimation method. Also, while differences between six- and nine-day windows were not detected using any of the counting methods, a difference was detected when using true ovulation. This is likely due to the ratio of target days identified compared to the number of non-target days identified. That is, while the six-day window identifies slightly fewer target days than the nine-day window, it also identifies fewer non-target days than the nine-day window. These ratios appear to average out, suggesting that both six-day and nine-day windows are, generally, going to yield similar results when using counting methods. This, however, has less to do with increased accuracy, and more to do with averaged error to accuracy ratios in identifying fertile days.

Finally, our simulated results suggest that fertility overlays, though popular, perform very poorly with respect to both effect size and power. Based on these results, we have to disagree with recommendations that researchers use fertility overlays as a method for estimating fertility (Gildersleeve et al., 2014a). While using overlays may help alleviate concerns of researcher degrees of freedom abuse due to inconsistent definitions of fertility and fertility window size (e.g., Harris, 2013), such an approach may substantially impair power to detect effects. However, if research design imposes limitations preventing other methods of assessment (e.g. LH testing), it is strongly recommended that researchers maximize potential power by at least following up with participants regarding onset of menstruation so that backwards estimation can be used to center overlay fertility values with participants’ observed data, or so that researchers can forego the use of fertility overlays in favor of more accurate identification of high and low fertility days.

Limitations and concerns

One possible limitation of our study is that we simulated ovulatory cycles using descriptive statistics and covariances from a study that used hormonal assessment methods to determine occurrence of ovulation. As a result, any error inherent in that estimation method may have been passed on to our simulated data. However, there is substantial evidence suggesting that hormonal estimation of ovulation is highly accurate (Behre et al., 2000; Lloyd & Coulam, 1989; Tanabe et al., 2001). This suggests that estimation inaccuracy was likely low in the empirical data informing this study (Fehring et al., 2006; Fehring et al., 2013), and that hormone-based estimates of fertility are a reasonable proxy of true ovulation for comparison with counting methods.

Another possible limitation is the process trajectories considered in the present study. These were selected for two principal reasons. One, they demonstrate processes that are either solely dependent on fertility (Processes A and C), or that are associated with co-occurring processes across the ovulatory cycle such as hormone level fluctuations (Processes B and D; e.g., Lukaszewski & Roney, 2009; Roney & Simmons, 2008). Two, they demonstrate varying degrees of complexity ranging from a simple, constant, step-wise effect during periods where change occurs (Processes A and B) to a curvilinear trajectory (Processes C and D) similar to those in reports of ovulatory effects on behavior (e.g., Miller et al., 2007; Roney & Simmons, 2008). While we cannot simply assume that any of these simulated trajectories are perfectly representative of specific behavioral processes, they do demonstrate how representative processes varying in complexity affect power to detect ovulatory effects, and can help inform researchers as they develop future research designs.

One of the estimation methods was pseudo backward counting. Pseudo backward counting is commonly used in lieu of true backward estimation. Rather than confirming menstrual onset and counting backwards, researchers use a participant’s estimate of their next menstrual onset and count back from that date. The problem, as previously discussed, is that many factors may cause a participant to be inaccurate in their estimation of menstrual onset—even if they consider themselves to be regularly cycling (e.g., Wilcox et al., 2000). However, this problem is not studied as well as some others. As a result, we erred on the side of caution in developing our probability of participant error in predicting their next menstrual onset, such that error rates were approximately normally distributed around the correct date, with error being no more extreme than ±2 days. The rationale behind this was to be as consistent as possible in defining the cycles as regular for participants (though this is rarely confirmed). Using this relatively small rate of error, pseudo backward and backward estimation resulted in similar efficacy.

However, if the error rate was increased in relation to frequency of error and degree of error then it is easy to demonstrate how pseudo backward estimation could be as ineffective, if not worse, than forward counting estimation. For example, if we modify the simulation procedure used to generate the previously reported pseudo backward days to have a 30 % chance of correct prediction, 22.2 % chance of ±1 day, 17.8 % chance of ±2 days, 13.3 % chance of ±3 days, 8.8 % chance of ±4 days, and 4.4 % chance of ±5 days then rates of correct identification and incorrect identification are comparable, if not a little worse, than forward counting (see Table 8). Similarly, when using empirically demonstrated prediction error proportions based on average cycle length (i.e., Creinin et al., 2004) pseudo backward counting performs on par with forward counting. Clearly, pseudo backward estimation is heavily dependent on accuracy of menstrual onset prediction. As a consequence, pseudo backward estimation should not be assumed to be as effective as backward counting given that without menstrual onset confirmation there is no way to be certain that the correct date to count back from was identified. Further, while backward counting appears to be more accurate than forward counting because the luteal phase is more stable than the follicular phase, it is important to recall that it is still variable.

Table 8 Accuracy of pseudo backward estimation using alternative error probabilities in simulation

Another potential limitation is that for t-tests, while we focused on different fertile windows as tenable for sampling high fertility days, we only sampled from the same luteal phase day (ovulatory calendar day 7 or day 21), or its closest neighbor. However, like high fertility day sampling, low fertility day sampling methodology is quite variable and ranges from a target day in the luteal phase (e.g., Cárdenas & Harris, 2007), a range of days or all days in the luteal phase (e.g., Caryl et al., 2009; Moore et al., 2011; Morrison, Clark, Gralewski, Campbell, & Penton-Voak, 2010; Pawlowski & Jasienska, 2005), and all days outside the high fertility window (e.g., Little, Jones, & Burriss, 2007; Little, Jones, & DeBruine, 2008; Penton-Voak & Perrett, 2000; Puts, 2005; Vaughn, Bradley, Byrd-Craven, & Kennison, 2010). In terms of error in identifying high fertility days as low fertility days, these methods did not differ substantially in the present study (see Table 9), and we have focused on reporting results when using the luteal sampling method that yielded the lowest overall error rate.

Table 9 Percent of errors made using different low fertility sampling methods

It is worth noting, though perhaps self-evident from the trajectories that were considered, that selecting different low fertility sampling methods would have little impact on effect sizes when sampling from processes that only generated a behavioral shift during high fertility (i.e., Processes A and C). However, when the process has a secondary shift during the luteal phase (i.e., Processes B and D) then only sampling low fertility days from the luteal phase would reduce the average difference in the target behavior between high and low fertility phases (see Table 10). If a process had a consistent relation to hormonal levels (i.e., Process D), then regardless of where low fertility days are sampled from, the average difference between low fertility and high fertility days may be smaller. Therefore, how researchers choose to sample both high and low fertility days can impact their ability to detect behavioral shifts between these phases. Problematically, little is known about behavioral trajectories in relation to phases over the cycle, so it may be difficult for researchers to determine a priori what the best sampling approach may be.

Table 10 Average behavioral scores by ovulatory cycle day

Future directions

Within-subjects t-tests performed better than between-subject t-tests and overlay correlations, but this represents a very minimal approach to studying ovulatory effects. An advantage of sampling more frequently from each participant’s ovulatory cycle is that it would allow researchers to assess between-phase effects using multilevel modeling (e.g., Miller et al., 2007; Prokosch et al., 2009; Roney & Simmons, 2013). This modeling approach has several advantages, such as allowing for the evaluation of multiple time scales concurrently (e.g., participant age, ovulatory calendar day, and day of the week) while controlling for within-subject variability. Additionally, with more observations across the cycle, estimates of average participant-specific scores and fluctuations would be more accurate, and would allow for more complex partitioning of score variance.

Moreover, increased frequency in sampling would help researchers to actually describe the trajectories of target behaviors across the cycle—better informing theory about ovulatory shifts of target behaviors and best methodological practices to sample them. As sampling frequency from each participant increases, the likelihood of sampling from distinct ovulatory phases for each participant also increases. Consequently, researchers could describe these processes across the ovulatory cycle (e.g., using linear growth models; McArdle, 2009), and investigate how distinct ovulatory phases (e.g., fertile days) or hormonal correlates (e.g., estrogen and progesterone) are associated with these behavioral trajectories. In this way, the study of ovulatory effects could transition from an investigation of one to two days sampled from a 22- to 36-day cycle to the study of the cycle itself.

Finally, an inherent limitation in one- to two-day sampling from the cycle is the possibility of under-representing periods of interest from the cycle. While counting estimation methods can reduce the likelihood of this, they are not full-proof. In fact, we found that the easiest and most common counting method, forward, is less than half as effective at sampling from fertile days as taking a random sample, and both backward methods, while more effective than forward, were still only about three-quarters as effective as taking a random sample (i.e., for a one-day peak). Further, true backwards counting can only truly be applied after sampling has occurred. The more typically used pseudo-backward counting method relies on an estimate of next menstrual onset that necessitates an assumption of cycle length regularity and participant accuracy. This is an assumption that has been demonstrated to be untenable for several reasons (e.g., Creinin et al., 2004; Fehring et al., 2006; Small et al., 2007; Wilcox et al., 2000). By increasing sampling frequency, at a minimum, true backwards estimation or LH testing could be incorporated, and the risk of failing to sample from critical phases of the ovulatory cycle would be reduced.

Conclusion

The results of our study cannot inform the validity of the ovulatory shift hypothesis, or the broader field of research on ovulatory effects in humans. Nor does our study discuss a novel concern over the potential inadequacy of methodology popularly employed in this area of research. What our study does is make explicit the degree of the problem that popular methods can create. With these methodological limitations made clear, debate over best practices on these topics can reside on more than conjecture, and the field can adopt some minimal methodological guidelines to maximize power while mitigating the risk of spurious findings as it moves forward.