Background

Infertility affects approximately 10–15% of couples in the United States [1]. Utilization of infertility treatments, such as assisted reproductive technology (ART), have increased in the past three decades [2,3,4]. As ART usage increases, so does interest in understanding how women’s infertility and treatment history affect long-term health outcomes. Previous research suggests that women who experience infertility, subfertility, or reduced parity and women who utilize fertility treatments may have increased risk of certain chronic diseases [5,6,7,8,9,10]. To assess infertility history in epidemiologic studies, accurate and feasible measures of infertility and fertility treatment history are required.

A recent systematic review of ART-based validation studies indicated a lack of rigorous publications on the validation of routinely collected data from fertility populations [11]. While medical records are often the “gold standard” to collect information, utilizing medical records may not always be feasible, particularly for epidemiologic studies that have a large sample size or are population-based. Moreover, information on lifestyle factors (e.g. smoking history, diet, physical activity) that may serve as potential confounding [12] or mediating variables [13] may be absent from medical records, inconsistently documented, or inaccurately recalled. Self-reported measures are widely utilized in epidemiologic research and are often considered more cost-effective. However, there is insufficient research on the accuracy of self-reported measures of infertility, especially over an extended period of follow-up. Understanding recall after an extended follow-up period is especially important for research related to chronic health conditions that may have a significant lag between exposure and disease onset. Prior research on the validity of recalled infertility history and fertility treatment has been limited in duration of follow-up, with prior studies ranging from several months to a few years [14,15,16,17,18,19]. In research that followed some participants for a longer duration of time (maximum 17 years), only a minority of participants (< 20%) were followed for 8 or more years [20]. To overcome these previous limitations, our study evaluated women’s recall of infertility and treatment history approximately 20 years after treatment initiation and compared self-reported measures captured in 2018 to medical records and self-reported data collected at prior study initiation (1994–2003).

Methods

The details of recruitment and participation in the original IVF Study (IVF study) have been described previously [21, 22]. Briefly, from 1994–2003 and 1999–2003, 2688 couples newly enrolled in in vitro fertilization (IVF) treatments were recruited from three IVF clinics near Boston, Massachusetts. At enrollment, medical history and lifestyle factors were obtained via a self-administered questionnaire prior to treatment. IVF treatment and outcome data were abstracted for up to six cycles from clinical records. This study was approved by the Institutional Review Board of Brigham and Women’s Hospital in Boston, Massachusetts.

In 2018, 15–23 years after enrollment in the IVF study, women were recontacted and asked to participate in the AfteR Treatment Follow-up Study (ART-FS). An initial recontact letter was mailed to women using her most recent address in the Mass General Brigham (formerly Partners Health Care) electronic health record system, used by two of the largest healthcare providers in Massachusetts. If no address was available, the address from the IVF Study record was used. Study data were collected and managed using Research Electronic Data Capture (REDCap) tools hosted by Brigham and Women’s Hospital [23, 24]. REDCap is a HIPAA-compliant, secure, web-based software platform designed to support data capture for research studies. Women were directed to use a provided REDCap study link to complete the survey online. Participants had the option to return a pre-paid postcard to request a paper copy of the questionnaire. If women did not reply to the initial letter, an additional letter was subsequently distributed 2–3 weeks later. If either the initial or subsequent letter was returned due to an incorrect address, we searched for participant’s addresses using an online search engine (https://premium.whitepages.com/, accessed April – June 2018) using exact matches to names and birth dates. Recontacted women were eligible to participate in the ART-FS. Those who completed the questionnaire, constituting consent, were included in analyses.

Data collection

Medical history and lifestyle factors

Medical history and lifestyle factors were obtained from self-administered questionnaires collected between 1994–2003 during the IVF Study. Information on a variety of domains including age, race/ethnicity, religion, marital status, highest level of completed education, cigarette smoking history, depression history, reproductive history, gravidity, occupational and environmental exposures, and previous pregnancy outcomes (therapeutic abortion, miscarriage or stillbirth, ectopic (tubal) pregnancy, liveborn pregnancy, molar pregnancy) were collected.

Fertility treatment history

Information on fertility treatment history was collected from three sources: i) the IVF Study clinical records, ii) the IVF Study self-reported questionnaire, and iii) the ART-FS questionnaire (Fig. 1). We compared treatment recall across two periods of time: i) prior to the IVF Study enrollment and ii) during the IVF Study. The IVF Study questionnaire was completed during study enrollment, prior to start of IVF treatments. The questionnaire asked about ever use of fertility treatments prior to IVF Study enrollment. Women were asked: “Have you previously received IVF or GIFT?” and “Have you previously received clomid or pergonal to stimulate your ovaries?” To collect information on fertility treatment history during IVF Study enrollment, we utilized clinical records on the number of cycles of fresh or frozen embryo transfer IVF each woman received.

Fig. 1
figure 1

Description of data sources from AfteR Treatment Follow-up Study and IVF Study

On the ART-FS questionnaire, women were asked about their treatment across three time points: i) prior to IVF Study enrollment, ii) during the IVF Study, and iii) after the IVF Study. Time periods i) and ii) are compared in this analysis. Women were asked, “How many cycles of the following types of fertility treatments did you undergo before you began the [IVF Study] in [start month and year]?” with the following response options: Clomid, Gonadotropin injections, fresh embryo transfer IVF, and frozen embryo transfer IVF (range from 0 to 7+ cycles). Women were also asked to recall their fertility treatment during the IVF Study (“How many cycles of the following types of fertility treatments did you undergo between [start of IVF Study participation month and year] and [end of IVF Study participation month and year]?”). Women could report the number of IVF cycles separately for fresh and frozen embryo transfers (range 0 to 7+).

Infertility diagnoses

In the IVF Study, infertility diagnoses were collected from two sources: i) clinical records and ii) self-reported questionnaires. On the IVF Study questionnaire, women were asked: “What is your understanding of the cause(s) of your fertility problem?” and could self-report (Yes or No) multiple infertility problems: blocked or absent tubes, cervical problems, Diethylstilbestrol exposure, a double or divided uterus, endometriosis, male factor (low sperm count, etc), fibroids, polycystic ovaries, and other with no indication of priority (primary, secondary, etc). Given the structure of the questionnaire, infertility problems with missing responses were assumed to indicate the absence of that condition if at least one other infertility problem was indicated. On the IVF Study questionnaire, a woman reporting “fibroids” was categorized as having “Uterine factor infertility” and “blocked or absent tubes” was categorized as having “Tubal factor infertility”. If she reported a write-in response that included “perimenopausal”, “age” or “premature ovarian failure” she was categorized as having “Diminished ovarian reserve/Increased maternal age”. In clinical records, codes belonging to diagnostic groups were reviewed and categorized to align with the infertility diagnosis categories that were defined for the analysis (PCOS, Endometriosis, Uterine factor infertility, Tubal factor infertility, Diminished ovarian reserve/Increased maternal age, Male factor infertility, Other/unknown).

On the ART-FS questionnaire, women were asked “What do you remember as being the primary reason for why you utilized infertility treatments in the IVF Study starting in [start month and year]?” Women reported their primary infertility diagnosis (PCOS, endometriosis, uterine factor infertility, tubal factor infertility, diminished ovarian reserve, male factor infertility, increased maternal age, or other). Responses of other or missing responses were categorized as Other/Unknown.

Statistical analysis

To assess participant differences by participation in the ART-FS, we compared women who enrolled in the ART-FS to those who did not enroll. Specifically, we assessed differences in medical and lifestyle factors reported at enrollment and clinical outcomes from the IVF Study. To evaluate the accuracy of self-reported treatment history, we calculated the validity and reliability of treatment history reported at ART-FS compared to report on IVF Study questionnaire. We looked at use of IVF, Clomiphene or Gonadotropin injections, and any fertility treatment, considering the IVF Study self-report as the gold standard. We also calculated validity and reliability of self-reported IVF treatment details from the ART-FS compared to IVF Study medical records. Usage of fresh cycles and frozen cycles (yes or no) were compared. Similarly, accuracy of number of IVF cycles (fresh, frozen, and fresh and frozen combined) was evaluated.

To evaluate recall of infertility diagnosis, we compared self-reported primary infertility diagnosis from the ART-FS to self-reported diagnoses from the IVF Study. Women could self-report multiple infertility diagnoses at IVF study enrollment, but only a primary infertility diagnosis at ART-FS. Therefore, we considered two groups: i) a restricted sample of women who self-reported one infertility diagnosis and ii) a sample of all women who reported any number of diagnoses during the original IVF Study, where “valid recall” was classified as one of the diagnoses reported during the IVF Study was recalled as the primary infertility diagnosis on ART-FS. We also compared self-reported primary infertility diagnosis from the ART-FS to i) the primary clinical diagnosis only and ii) any clinical diagnosis (primary, secondary or other), abstracted from clinical records, when one of the clinical diagnoses recorded during original IVF Study was recalled as the primary infertility diagnosis on ART-FS we classified this as “valid recall”.

For all analyses, reliability was calculated as either Cohen’s kappa coefficient (K, a measure of inter-rater agreement for binary items) or weighted Cohen’s kappa coefficient (Kw, for inter-rater agreement of items with more than two categories). Kappa coefficients take into consideration the possibility of agreement between raters occurring by chance, so they are thought to be more robust than percent agreement (another measure of inter-rater reliability), though more conservative [25]. The kappa coefficient is widely used in agreement studies of categorical data though it has been noted to be vulnerable to the prevalence of the underlying disease and the tendencies of raters to classify test results a certain way [26]. The kappa coefficient has been used previously in studies examining recall. Some examples include the recall of menstrual irregularity [27], recall of health care resource utilization compared to abstracted medical records [28], and recall of medication use compared to prescription database records [29]. Validity was calculated as sensitivity and specificity. 95% confidence intervals (95% CIs) were calculated for all measures. Statistical analyses were conducted using SAS v9.4 software (Cary, NC).

In sensitivity analyses, we stratified the study population by those who reported receiving additional IVF treatments after the IVF Study and repeated our analyses comparing accuracy of self-reported treatment history at ART-FS to treatment history from the IVF Study questionnaire and clinical records to see if their recall differed from those who did not have additional IVF treatments. We also considered the possibility that women in the IVF Study might have received further infertility diagnosis information during additional clinical treatments, which could affect their recall during the ART-FS. We repeated our main analyses comparing primary infertility diagnosis reported during the ART-FS to self-reported diagnoses from IVF Study enrollment and diagnoses from IVF Study clinical records under two scenarios: (i) excluding women who received additional IVF treatments after their participation in the IVF Study and (ii) excluding women who received more than two IVF cycle treatments during the IVF Study.

Results

Of the 2644 women in the IVF Study, 2244 (85%) were successfully recontacted and 909 consented (41% of those recontacted, Fig. 2). Of these women, 808 women (89%) completed the ART-FS questionnaire and were included in the analyses. Women who completed the ART-FS (completers) among those successfully recontacted had on average 19.6 years (standard deviation (SD) 2.7) between treatment initiation and follow-up. Completers were more likely to be non-Hispanic white, have completed graduate school, and were more frequently never smokers at the time of enrollment in the IVF Study, compared to those who did not complete the ART-FS (non-completers) (Table 1). We saw no meaningful difference in age, marital status, use of depression medication, and history of pregnancy at and history of miscarriage reported at IVF Study enrollment between completers and non-completers. Completers were more likely to have had at least one successful IVF cycle (resulted in a livebirth or at least a chemical pregnancy with unknown pregnancy outcome) during the IVF Study than non-completers. According to clinical records, 98.6% of our study sample had at least one fresh IVF cycle and 20.2% had at least one frozen IVF cycle between their enrollment and end of follow-up in the IVF Study (Table 1).

Fig. 2
figure 2

AfteR Treatment Follow-up Study participants recontacted and recruited from the IVF Study

Table 1 Demographics of women in IVF Study (1994–2003), by response to ART-FS (2018), N = 2644

When we evaluated the reliability between self-reported fertility treatment prior to the IVF Study reported during the IVF Study and during the ART-FS, sensitivity and specificity values were consistent across different fertility treatment modalities (prior use of IVF: sensitivity = 0.85, specificity = 0.63; prior use of Clomiphene or Gonadotropin injections: sensitivity =0.81, specificity = 0.55; prior use of any fertility treatment: sensitivity = 0.85, specificity = 0.52) (Table 2). We also compared recall of specific IVF treatment details (type of transfer, number of cycles), comparing self-reported data from the ART-FS to clinical records. Sensitivity of recall of ever use of fresh IVF cycles was high (0.88, 95% CI 0.86, 0.90) but specificity was low (0.27, 95% CI 0.01, 0.54) (Table 3). For frozen cycles, sensitivity was 0.56 (95% CI 0.49, 0.64) and specificity was 0.71 (95% CI 0.68, 0.75). Kw’s comparing number of self-reported IVF cycles to clinical records were moderate; for all combined cycles (fresh and frozen) Kw was 0.50 (95% CI 0.45, 0.55), for fresh cycles only Kw was 0.50 (95% CI 0.45, 0.55), and for frozen cycles only Kw was 0.40 (95% CI 0.32, 0.49).

Table 2 Fertility treatment usage before IVF Study reported at ART-FS compared to self-report at IVF Study
Table 3 IVF usage during IVF Study reported at ART-FS compared to clinical recordsa

When evaluating validity of self-reported recall of infertility diagnoses, sensitivity values and K’s were higher among women with a single self-reported infertility diagnosis (N = 509) than women with multiple diagnoses (N = 808) (Table 4). Among women with a single self-reported infertility diagnosis, recall of all infertility diagnoses had relatively high sensitivity (> 0.61) and specificity (≥ 0.79) (excluding uterine factor infertility which had a small sample size). Male factor infertility (K = 0.82, 95% CI 0.76, 0.87), endometriosis (K = 0.76, 95% CI 0.65, 0.86) and tubal factor infertility (K = 0.73, 95% CI 0.64, 0.82) had the highest agreement between the two self-reported questionnaires.

Table 4 Self-reported primary infertility diagnosis at ART-FS compared to self-report from IVF Studya

In general, the agreement between self-reported primary infertility diagnosis from the ART-FS and clinical records (Table 5) was not as strong as the agreement with self-report at IVF Study enrollment (Table 4). Restriction to the primary clinical diagnosis had higher sensitivity and K’s in comparison to values calculated when considering any diagnosis from the medical records (Table 5). However, the improvements were not large, and values of several diagnoses were unchanged (e.g. PCOS, uterine factor infertility, and diminished ovarian reserve/increased maternal age).

Table 5 Self-reported primary infertility diagnosis at ART-FS compared to clinical recorda

The recall of details of IVF cycles during the IVF Study (type of transfer, number of cycles) among those who received additional IVF treatments after the IVF study compared to recall of those who did not receive additional IVF treatments were generally the same (Supplemental Table 1). When we repeated our main analyses of infertility diagnoses (Tables 4 and 5) after excluding women who received additional IVF treatments after the IVF Study, the results were generally unchanged (Supplemental Tables 23). Similarly, when we instead excluded women who had more than two IVF cycles during the IVF Study, the results were generally unchanged compared to the results from our main analyses (Supplemental Tables 45).

Discussion

Principal findings

We observed that approximately 20 years after fertility treatment, women’s recall of a specific period of their treatment history varied greatly by the level of treatment detail, while recall of their primary infertility diagnosis varied by diagnosis. Recall of self-reported use of fertility treatment had consistently moderate sensitivity but low specificity across different infertility treatment modalities. Recalled details of IVF cycles (number of cycles, fresh or frozen embryo transfers) had low to moderate validity and reliability compared with medical records. We found that accuracy of primary infertility diagnosis recall was higher for self-report compared to medical records. Validity and reliability for primary infertility diagnosis also varied greatly depending on the diagnosis.

Interpretation

Prior research focused on the validity and reliability of recalled fertility treatment and infertility diagnoses has been sparse with limited duration of follow up. In a previous study by Thomas et al. [14], 63 women receiving services from a specialized fertility treatment center in 2004 reported that elements of women’s fertility treatment history could be accurately captured (more than 90% sensitivity for all elements) by a self-reported questionnaire, 5–6 years after treatment initiation [14]. Research from the Nurses’ Health Study II, also supports this finding, and found > 80% concordance of self-reported gonadotropin use when comparing prospective reports to lifetime history with a maximum of 16 years of follow-up [30]. In our study, the correlation between self-report of ever use of IVF and medical records was high (K = 0.74, 95% CI 0.57, 0.90; sensitivity = 0.96, 95% CI 0.88, 1.00; specificity = 0.82, 95% CI 0.69, 0.94). In comparison, we observed low to moderate validity and reliability between self-reported treatment history at follow-up and self-reported treatment history at original study initiation. The lower values that we detected could be due to several factors. In our study, participants were asked to recall details an average of 20 years after treatment initiation (approximately 15 years longer than other studies). It has been shown for other health conditions that self-report is subject to recall bias, particularly with increasing duration between the event and the survey [31]. Our results also examined precise treatment details (treatment during clearly defined time periods, number of cycles, fresh versus frozen embryo cycles). To our knowledge, this is the first study to examine these details of fertility treatment history. However, the complexity of these details may represent a barrier to recall given the assumed health literacy necessary to recall accurately. This level of information may not be appropriate to utilize in studies involving participants from the general public. The questionnaire developed by Thomas et al. prefaced sections on various fertility treatments with introductory sentences defining the treatment modality in clear terms (e.g. “…By ART treatment, we mean any treatment that involves removing the egg from the woman’s body and then replacing the egg or embryo back into the body”) and to capture pregnancies and attempts to conceive, provided an extensive definition for an “attempt” and multiple examples of responses using their definitions for different scenarios. Therefore, future investigators could consider asking about a woman’s fertility history more broadly and provide definitions or examples for critical items of interest to capture more accurate information, especially over an extended period of recall.

In our study, we observed that accuracy of infertility diagnosis at follow-up was higher when compared to self-report at treatment initiation than when compared to medical records. To our knowledge, this is the first study to report comparisons between to self-report at prior study enrollment and medical records. The higher validity and reliability across self-report could suggest that there are differences in the way that women interpret or attribute cause to their infertility compared to clinicians. This may have implications for clinical practice and clinicians may consider ensuring diagnoses and results are more clearly communicated to patients.

Our analyses of primary infertility diagnosis also revealed great variability in validity and reliability depending on the specific diagnosis. It is possible that participants could have reported a secondary instead of their primary diagnosis during the ART-FS due to recall issues, however, in analyses where we considered women with one or more infertility diagnoses during the IVF Study (Tables 4 and 5), recall was not improved. It is also plausible that women who have unsuccessful fertility treatment attempts may receive additional infertility diagnoses as their treatment progresses. However, in sensitivity analyses where we (i) excluded women who reported receiving additional IVF treatments after the IVF Study or (ii) excluded women who received more than two IF cycles during the IVF Study, recall was not improved compared to our main results (Tables 4 and 5). Highest values comparing self-report to clinical records in our study were seen for primary diagnoses of male factor infertility (K = 0.66, 95% CI 0.61, 0.72; sensitivity = 0.67, 95% CI 0.61, 0.72; specificity = 0.96, 95% CI 0.94, 0.97) and tubal factor infertility (K = 0.62, 95% CI 0.54, 0.70; sensitivity = 0.54, 95% CI 0.46, 0.63; specificity = 0.98, 95% CI 0.98, 0.99). A study by de Boer et al. [20], comparing self-reported diagnoses to medical records in the Netherlands, also reported that the highest validity and reliability values were seen for a diagnosis of either male factor (K = 0.71; sensitivity = 0.78; specificity = 0.91) or tubal factor infertility (K = 0.79; sensitivity = 0.84; specificity = 0.94). Male factor and tubal factor infertility may have a more clearly defined etiology and therefore have higher accuracy of recall, compared to less prevalent and complex factors such as hormone-related infertility. De Boer et al. observed that fewer than 18% of participants had 8 or more years of follow-up [20] while in our study the average time between recall and treatment initiation was almost 20 years. The greater period of follow-up combined with the differences in measurement of infertility in our study’s medical records compared to the ART-FS questionnaire may have contributed to the overall lower values of validity and reliability compared to de Boer et al. [20]. This suggests that investigators who are planning a study involving infertility diagnosis recalled over an extended time should consider providing more details about or specific examples of the infertility categories they are interested in capturing.

Strengths and limitations

The ART-FS was formed from a previous cohort of women who sought IVF services approximately 20 years ago, which to our knowledge, is the longest period of follow-up with detailed self-report and medical record data available in the current literature [14, 20]. Our study accessed extensive clinical records from a prior IVF study, allowing us to consider the accuracy of recalled details of fertility treatment (fertility treatments during a specific timeframe, number of cycles, fresh versus frozen embryo transfers) that had not been considered by previous studies. Additionally, we were able to evaluate the accuracy of self-reported infertility and treatment at follow-up compared to self-report at treatment initiation, which to our knowledge has not yet been reported.

Despite these strengths, there are several important limitations to our study that should be considered. There is potential misclassification of infertility diagnosis due to the different terminology used across the medical records and two separate questionnaires. As noted previously, this may affect less prevalent diagnoses and/or diagnoses with more complex etiology or diagnostic criteria (e.g. uterine factor infertility, diminished ovarian reserve/increased maternal age) more so than other more specific diagnoses (e.g. tubal factor or male factor infertility). During the ART-FS, we only asked participants to report their primary infertility diagnosis, while at treatment initiation and in medical records, multiple diagnoses could be recorded. As a result, while we were able to successfully consider women with a singular diagnosis, we were not able to effectively evaluate women with multiple diagnoses. Indeed, when we restricted our sample sizes to women who either only self-reported one diagnosis (at treatment initiation) or only had a primary infertility diagnosis (in medical record), validity and reliability values increased. In addition, changes in infertility diagnoses or clinical diagnosis procedures compared to when our cohort began fertility treatments (1994–2003) may reduce generalizability compared to current treatment standards.

It should also be noted that the women who did participate in our analysis differed with regards to certain characteristics from the women who we were either not able to recontact or who chose to not participate in the ART-FS. Women in the ART-FS were more likely to be non-Hispanic white and to have at least a college degree. These women were also more likely to have had a successful IVF cycle during the IVF study (55%) compared to women who chose not to participate (31%) and women who we were not able to recontact (46%). These differences may affect our ability to generalize our results to other groups of women utilizing infertility treatments. It is possible that women who were less fixated on the outcome of their IVF cycles during the IVF Study were less likely to accurately recall the details of their treatment. For example, women who experienced a successful IVF cycle could have been more satisfied with their treatment and less likely to recall the details of their treatment in the same way as women who did not have a successful IVF cycle and therefore, may have been less satisfied with their treatment. Few studies that have investigated the potential association between patient perception/experience during clinical interactions with their recall ability have produced mixed results [32, 33] and recent evidence is lacking.

Conclusions

In order to use women’s self-reported fertility data for research purposes we must have confidence that this information is recalled and reported accurately. Our study examining women’s recall of their infertility and treatment history almost 20 years after their fertility treatment initiation shows that women previously treated for infertility are moderately accurate in their recall very specific treatment details. Reliability of self-reported infertility diagnosis varied by diagnosis and method of measurement. Researchers should consider these issues when designing studies and utilizing self-reported history of infertility to improve the accuracy of measurement collection.