Background

Controlled ovarian stimulation (COS) is the key to successful assisted reproductive technology (ART). Individualization of COS in in vitro fertilization (IVF) treatments should be based on assessing ovarian reserve and predicting ovarian response for every patient [1]. The starting point is to identify if a patient is likely to have a normal, poor, or high response, and choose the best treatment protocol tailored to this prediction [1]. Patients’ characteristics and biomarkers could accurately predict ovarian response [2]. However, although numerous biochemical measures have been developed to predict IVF outcomes, some biochemical measures, such as estradiol (E2), luteinizing hormone (LH), basal follicle-stimulation hormone (FSH), and inhibin concentrations, fluctuate greatly on the day of the menstrual cycle and do not significantly change with decreasing of ovarian reserve, thus they have limited use owing to a low predictive value [3, 4]. Studies have shown that antral follicle count (AFC) is a better indicator to predict ovarian response than other endocrine markers [5, 6].

AMH, a dimeric glycoprotein, is a member of the extended transforming growth factor-β (TGF-β) family [7, 8]. AMH production diminishes as the follicles become FSH-dependent [8, 9]. Serum levels are not affected during the menstrual cycle, are most probably not manipulated by exogenous steroid administration, and are closely correlated with reproductive age [10]. Therefore, AMH has been used to predict poor and high response in IVF. Several studies argued that the level of AMH is a better predictor of ovarian response than the AFC [11]. However, the data remains conflicting and inconsistent [10]. Furthermore, some studies continue to advocate both AFC and AMH as possible predictors of ovarian response [12]. Although Broer and his colleagues [13, 14] have performed meta-analyses in 2009 and 2011 and demonstrated that AMH has at least the same level of accuracy and clinical value for the prediction of poor or excessive response as AFC, the number of the included studies in their meta-analysis were small (N = 5–12). Therefore, our study aimed to conduct a meta-analysis that included more eligible studies, to assess the diagnostic value of AMH and AFC for predicting poor or high response in IVF treatment.

Methods

The present meta-analysis was performed according to the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) statement [15].

Search strategy and data sources

The data sources include these electronic databases: PubMed, EMBASE, and the Cochrane Library (up to 1 May 2022). The following keywords were used: in vitro fertilization (IVF), in vitro fertilization, fertilization in vitro, assisted, or intracytoplasmic in combination with Anti-Mullerian Hormone (AMH), Mullerian-Inhibiting Factor, Mullerian-Inhibitory Substance, Mullerian Inhibiting Hormone, or Antral Follicle Count (AFC). There was no language limitation, and we also retrieved articles by manual screening. A complete search strategy for literature search has provided in Supplementary material.

Inclusion and exclusion criteria

The inclusion criteria were based on the Population, Intervention, Comparator, Outcomes, and Study designs (PICOS) structure: P): adult infertile women; I) patients receiving COS for IVF/ICSI; C) AMH or AFC to predict ovarian reserve; O) ovarian response including poor or high response; S) prospective design. Besides, if 2 × 2 tables were constructed from the data presented in the paper, the study was included for final analysis in this meta-analysis. Reviews, conference abstracts, case reports, letters, and animal trials were excluded from this study.

Data extraction

Information was extracted from eligible studies by two authors independently. The following information was included: the authors of the articles, publication year, study location, definition of poor or high response, sample size, true positives (TP), false positives (FP), false negatives (FN), true negatives (TN), and cut-off value. Disagreements were resolved by discussion among all authors.

Study quality assessment

Our study adopted the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) [16] to assess the quality of the included articles, which was the most recommended quality assessment tool for diagnostic accuracy tests. It consists of four main components: patient selection, index test, reference standard, and flow and timing. All components will be assessed for risk of bias, and the first 3 components will also be assessed for clinical applicability. The risk of bias is judged by signature questions, but there are no signature questions for clinical applicability. The “yes”, “no” or “uncertain” answers to the signature questions included in each component may correspond to a bias risk rating of “low”, “high” or “uncertain”. If the answer to all the signature questions in a range is “yes”, then the risk of bias can be assessed as low; If the answer to one of the questions is “no”, the risk of bias is judged to be “high”. The “uncertain” refers to the fact that the literature does not provide detailed information that makes it difficult for the evaluator to make a judgment, and can only be used when the reported data is insufficient.

Statistical analysis

This meta-analysis used Stata V.14.0 (Stata Corp LP) to conduct all statistical analyses. The Cochrane Q and I2 statistics were used to test the heterogeneity among all studies. I2 > 50% indicates the existence of heterogeneity. The bivariate regression model was used to calculate the pooled sensitivity, specificity, and area under the receiver operator characteristic (ROC) curve, and their 95% confidence intervals (CIs). Overall performance was assessed by estimating a pooled ROC curve between AMH and AFC. Furthermore, meta-regression was used to explore the causes of heterogeneity between the studies. Subgroup analyses were performed based on the cut-off value and sample size. Deeks’ funnel plot was used to test publication bias. A two-tailed probability value below 0.05 was regarded as statistically significant.

Results

Study selection and study characteristics

In sum, 7327 articles were identified in electronic and manual searches. However, 1847 articles were excluded for duplication, and another 2698 articles were excluded due to study types (reviews, meeting abstracts, letters, animal trials, and case reports). In addition, 2680 records were excluded after reviewing the title and abstract, and we excluded 60 records after reviewing the full text of 102 articles. Finally, 42 articles [10, 11, 17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56] were included in this meta-analysis (Fig. 1).

Fig. 1
figure 1

Flow diagram of the process of studies selection

The characteristics of the eligible studies are listed in Tables 1 and 2. The sample sizes of participants in each study ranged from 44 to 571, and this meta-analysis included 7190 individuals. Of the 42 studies, all studies were prospective design. The publication year of 42 studies ranged from 2002 to 2021. The included studies were from different countries, including China (n = 3), Spain (n = 4), the UK (n = 7), the USA (n = 4), and so on. AMH was used in 29 studies, and AFC in 15 studies in terms of poor response. As for the high response, AMH was used in 13 studies, and AFC in 6 studies.

Table 1 Characteristics of the studies included of AMH for predicting ovarian response
Table 2 Characteristics of the studies included of AFC for predicting ovarian response

Study quality

We adopted the QUADAS-2 to assess the quality of concerning studies (Supplementary material). Regarding risk of bias, 5 studies included consecutive patients, and 37 studies were low risk in index test. Besides, as for applicability of concern, all studies were low risk in both patient selection and index test.

Accuracy of AMH and AFC for predicting poor response

The pooled predictive ability of AMH and AFC for poor response in IVF/ICSI treatments was presented in Table 3. The overall pooled sensitivity and specificity of AMH were 0.80 (95%CI: 0.74–0.85) and 0.81 (95%CI: 0.75–0.85), respectively. The test for heterogeneity demonstrated that there was significant heterogeneity in both sensitivity and specificity (I2 = 68.26% and 92.43%, respectively). The overall ROC curve was presented in Fig. 2A, and AUC was 0.87 (95%CI: 0.84–0.90). The meta-analysis’s overall pooled sensitivity and specificity of AFC were 0.73 (95%CI: 0.62–0.83) and 0.85 (95%CI: 0.78–0.90), respectively. Heterogeneity was found in both sensitivity and specificity (I2 = 85.28% and 91.76%, respectively). The overall ROC curve was presented in Fig. 2B, and AUC was 0.87 (95%CI: 0.84–0.90).

Table 3 Results of the subgroup analysis
Fig. 2
figure 2

The summary receiver operating characteristic (SROC) curve of AMH and AFC for the prediction of ovarian response. A AMH-poor response; B AFC- poor response; C AMH-high response; D AFC-high response

Accuracy of AMH and AFC for predicting high response

Table 3 presented the pooled predictive ability of AMH and AFC for high response in IVF/ICSI treatments. The meta-analysis’s overall pooled sensitivity and specificity of AMH were 0.81 (95%CI: 0.76–0.86) and 0.84 (95%CI: 0.77–0.90), respectively. Heterogeneity was found in both sensitivity and specificity (I2 = 83.00% and95.90%, respectively). The overall ROC curve was presented in Fig. 2C, and AUC was 0.89 (95%CI: 0.86–0.91). The overall pooled sensitivity and specificity of AFC were 0.85 (95%CI: 0.77–0.91) and 0.83 (95%CI: 0.64–0.94), respectively. The test for heterogeneity demonstrated that there was significant heterogeneity in both sensitivity and specificity (I2 = 74.53% and 96.70%, respectively). The overall ROC curve was presented in Fig. 2D, and AUC was 0.90 (95%CI: 0.87–0.92).

Subgroup analysis

Comparison of the summary estimates for the prediction of poor or high response showed significant difference in performance for AMH compared with AFC [poor (sensitivity: 0.80 vs 0.74, P < 0.050; specificity: 0.81 vs 0.85, P < 0.001); high (sensitivity: 0.81 vs 0.87, P < 0.001)]. There were no significant differences between the AUC of AMH and AFC for predicting high (P = 0.835) or poor response (P = 0.567). Besides, in the same definition of poor response (< 4 oocytes), AMH and AFC tests had significant differences in sensitivity (0.78 vs 0.81, P < 0.001) and specificities (0.77 vs 0.80, P < 0.001) (Table 3). However, no significant differences were found between the AUC of AMH and AFC (P = 0.800).

Meta-regression analysis

For AMH, the cut-off value was a significant source of heterogeneity (poor: P = 0.020). For AFC, the cut-off value was a significant source of heterogeneity (poor: P < 0.010; high: P < 0.050). However, sample size was not the significant source of heterogeneity (P > 0.05).

Publication bias

Deek’s plot indicated that there was no publication bias in AMH for predicting poor response (P = 0.510, Fig. 3A) and high response AFC (P = 0.348, Fig. 3C), and AFC for predicting poor (P = 0.396, Fig. 3B) and high response (P = 0.818, Fig. 3D).

Fig. 3
figure 3

Deek’s funnel plot for the publication bias. A AMH-poor response; B AFC- poor response; C AMH-high response; D AFC-high response

Discussion

Main findings

The present meta-analysis summarizes the available evidence about the accuracy of AMH and the AFC for predicting poor or high response to ovarian stimulation in IVF treatments. Although the differences were significant, both AMH and AFC had similar sensitivities and specificities. It seems that both AMH and AFC have a good discriminatory capacity to predict poor or high response in IVF. Besides, the ROC curves did not indicate a better predictive ability for AMH than for AFC, and the difference was not statistically significant. Our results were consistent with previous studies [13, 14, 48, 58]. For example, Broer et al. [13, 14] in their meta-analysis thought that both AMH and AFC are accurate predictors of poor or high response to ovarian hyperstimulation, and both tests appear to have clinical value.

Prior research indicated AFC is better than AMH to predict poor ovarian response [10]. However, several studies argued that the level of AMH is a better predictor of ovarian response than the AFC [11, 43]. In our study, results presented that a comparison of the summary estimates for the prediction of poor or high response showed a significant difference in performance for AMH compared with AFC while there was no significant difference in ROC curves. The discrepancies between studies could be associated with the heterogeneity of the definitions of ovarian response to ovarian stimulation. Therefore, our study conducted a subgroup analysis based on the definition of poor response, and we found that AFC was relatively better than AMH tests in both sensitivity (0.81 vs 0.78, P < 0.001) and specificities (0.80 vs 0.77, P < 0.001) when the poor response was defined as < 4 oocytes. However, although no significant differences were found in ROC curves, AFC seemed to perform slightly better than AMH for predicting poor response (0.87 vs 0.84). Also, Broer et al. [13] had similar findings in AFC and AMH for the prediction of high response.

Our study found that the accuracy of AMH and AFC for the prediction of poor or high response had many different kinds of cut-off values, which is difficult for clinical practice. Therefore, the present study performed a subgroup analysis based on the range of cut-off values. The accuracy threshold value of AFC for predicting high response achieved the highest AUC when the cut-off value was ≥ 15. The corresponding AUC was 0.90 (95%CI: 0.88, 0.93) with a sensitivity of 0.89 and a specificity of 0.94, which indicates the predictive ability with this interval is higher than the range of cut-off value < 15.

The characteristics of patients could predict abnormal ovarian response, including age, menstrual cycle length, and body mass index. However, these factors have limited predictive value. Therefore, emerging studies reported that the multivariate models predicted ovarian response, and found the model could improve the predictive power [17, 59,60,61]. For example, Honnma et al. [60] thought that serum AMH in combination with age is a better indicator than AMH alone. Therefore, clinicians should consider patients’ characteristics and biomarkers together to accurately predict ovarian response in IVF treatments.

Clinical implications

The abnormal response may increase patient discomfort and even decrease the chance of pregnancy. According to the register of the Italian national assisted reproduction technique (ART) in 2010, it reported that 6.7% were canceled due to poor ovarian response, and 1.5% due to ovarian hyperstimulation syndrome (OHSS) in 52,676 IVF cycles [1]. In other words, more than 4300 cycles were canceled every year for an abnormal response to stimulation with gonadotrophins. Furthermore, approximately 35% of couples abandon IVF treatments for physical and psychological burden, and 10% for inadequate ovarian response in the first cycle [62]. Therefore, it is important to reduce the dropout rate in IVF treatments by reducing abnormal responses. Our study found that both AMH and AFC were a good discriminatory capacity to predict poor or high response in IVF. Besides, increasingly studies reported that AMH level is becoming a preferred method for the prediction of ovarian reserve in most women [7, 63]. A multivariable approach, combining patient characteristics and AMH also should be taken into account in the evaluation of ovarian response.

Limitations

Several limitations would be noted in this meta-analysis. First, relatively high heterogeneity still existed. Although we found that the cut-off value was a significant source of heterogeneity in the present study, heterogeneity was caused by other factors, such as study quality characteristics, and study populations among all included studies. In addition, we found that the quality of the included studies was poor, so more high-quality studies are needed to confirm our conclusions in the future. Second, language bias may exist due to the inclusion of only English articles in the meta-analysis. Third, the predictive value of AMH and AFC for ovarian response was not always assessed in a head-to-head comparison in the same study. The accuracy of the results will be affected to some extent due to the differences in cut-off value and sample size. For this issue, we have tried to enhance the persuasiveness of the paper through meta regression and subgroup analysis.

Conclusions

In sum, the present meta-analysis demonstrated that both AMH and AFC have a good predictive ability to predict poor or high responses in IVF treatment.