Introduction

Evaluation of the embryonic potential for implantation and successful pregnancy has always been an important step during in vitro fertilization (IVF) treatments. Previously, these evaluations were based on manual assessments of static embryo morphology observed in a microscope [1] or morphokinetics observed using time-lapse equipment [2]. However, these evaluations are both time-consuming and prone to inter- and intra-observer variability [3]. The introduction of deep learning-based AI for embryo selection has enabled automated evaluations only based on images [4,5,6] or video sequences [7,8,9].

These AI models have the potential to improve, automate, and standardize the ranking and selection of embryos. To ensure that these promises can be achieved, it is important to apply transparent and standardized approaches for evaluating how well a model achieves its goal of predicting the implantation likelihood of an individual embryo. This ability is often referred to as model performance [10].

In practice, such AI models are used to rank embryos within a treatment cycle in order to choose the order of embryo transfer, where the embryos most likely to result in a positive outcome are chosen first. However, since we only know the clinical outcome of one or a few embryos within each cycle, the model performance is often evaluated on a population of transferred embryos.

The performance of predictive models can be divided into ranking and prediction performance, also known as model discrimination and model calibration [11, 12]. Model calibration is often assessed by comparing the predicted and achieved rates in subgroups with similar predictions. This is primarily a visual method, and in general, it is recommended to evaluate model calibration using plots rather than combined statistics [13]. Model discrimination is most commonly evaluated through the use of receiver operating characteristic curves (ROC curves) and the area under this curve (AUC). This measure, which is equivalent to Harrell’s concordance (\(c\)) index [14], evaluates the probability of the model correctly identifying the positive embryo in a randomly chosen pair of transferred embryos with one positive and one negative outcome.

Studies on model discrimination with regard to predicting ongoing pregnancy based on embryo image data have reported a wide range of different AUC values for the same model [12]. However, the actual estimated performance might depend on factors like the test data (internal/external), laboratory practice, patient demographics, selection criteria, culture conditions, ploidy testing, donor source, and others [12]. For example, maternal age is a well-known covariate that affects not only the likelihood of a positive fetal heart beat but also the model discrimination with AUC values reported to vary between 0.66 and 0.76 for different age subgroups within a single clinic [15]. Thus, differences in covariate distributions between two populations can bias the comparison of the model’s performance, which may lead to incorrect conclusions.

Biases from factors not directly related to the image data can potentially be mitigated by explicitly accounting for those factors in the AI model. In example, several models use age in combination with static images or time-lapse sequences as input [4, 6, 16,17,18]. The inclusion of age will improve the prediction of the actual probability of a positive outcome (i.e., the model calibration) as well as the model discrimination on the overall population of transferred embryos. However, no study has shown that including age as input also improves the model’s ability to select the most viable embryo within a treatment cycle where the age is constant. A simple and explicit approach for addressing the effect of a covariate on a measure of performance is to restrict the analysis to subgroups with similar values of the relevant covariate, for example, age. However, this might result in uncertain performance estimates due to the smaller sample size in the subgroups. Another approach is to summarize the performance overall using a covariate-adjusted ROC curve [19]. This approach generates a weighted average of covariate-specific ROC curves, and the AUC of this can then be calculated as a measure of covariate-adjusted model discrimination performance.

Within the methodological field of epidemiology, comparisons of measures of, e.g., incidence rates across heterogeneous subgroups are often done by using direct standardization where covariate-specific estimates within each subgroup are weighted according to the relative occurrence of the covariate value in similar subgroups of a common reference population [20, 21].

In this study, we will use a standardization procedure to evaluate the performance of the iDAScore v1.0 model [7, 22], which is a deep-learning model for embryo selection based on a 3D convolutional network that aims to predict fetal heart beat. The model is only trained on time-lapse sequences as input and thus does not use any patient- or clinic-related inputs. Several studies have shown that the iDAScore model correlates with morphology, morphokinetics, euploidy, fetal heart beat, and live birth outcomes [15, 23,24,25,26]. In the evaluation of model performance, we will consider model performance at the population level.

The aim of this study is to perform a comparison of the clinic-specific performance of an embryo selection AI model, accounting for the differences in age distributions between clinics.

Materials and methods

Study design and participants

The main study population, which we used to evaluate clinic-specific performance, comprises of data from four fertility clinics collected between 2013 and 2022. We only included transferred embryos incubated for 5 to 6 days with known maternal age and fetal heart beat outcome. Furthermore, we restricted to autologous oocytes, single-transfer embryos with maternal age between 21 and 44 years. The data set was not part of the training data for the iDAScore v1.0 model (Vitrolife A/S, Aarhus, Denmark), and hence, we refer to it as the external population. All data used in this study were retrospective and provided in an anonymized format. In Denmark, the study described was deemed exempt from notification to the National Committee on Health Research Ethics according to the Act on Research Ethics Review of Health Research Projects (consolidation act no. 1338 of September 1, 2020).

The flow chart in Fig. 1 shows an overview of the exclusions in the external population.

Fig. 1
figure 1

Flow chart of the external population containing data from four fertility clinics

We defined a reference age population to be used in the standardization based on the independent internal test data set that was held out during the training of iDAScore v1.0 [7]. We applied the same inclusion criteria for this population as defined for the external population.

The clinical endpoint of the study was defined as the presence of a fetal heart beat at 6–8 weeks after embryo transfer, and we distinguish between fetal heart beat positive (FH +) and fetal heart beat negative (FH-) outcomes.

Statistical methods

We first visualized the age distributions and age-specific performance within each clinic. We estimated age distributions using a Gaussian smoothing kernel. The following describes the statistical methods used for assessing and comparing the clinic-specific discriminatory performance of the AI model accounting for the differences in age distribution between the clinics. We will use the term age-standardized to indicate that a performance measure accounts for age.

Neither the ROC curve nor the AUC measure can be standardized directly in order to account for differences in age distribution when comparing results between clinics, so the methods we applied are based on a ROC curve calculated on weighted data [27] in each clinic. We used the area under these curves to evaluate clinic-specific model performance. Since a ROC curve can be constructed from the sensitivity and specificity, the weights were applied in the calculation of these measures. This gives rise to a weighted ROC (WROC) curve, the area under which can be calculated and interpreted as a weighted, or standardized, AUC.

More specifically, since the sensitivity is based only on embryos with a positive outcome, for an FH + embryo, the weight is defined as the ratio between the age density in the positive subgroups of the reference and the clinic-specific populations evaluated at the maternal age. Similarly, the weight of an FH- embryo is the ratio between the age densities in the negative subgroups. This facilitates calculating weighted sensitivity and specificity values and thereby calculating the age-standardized AUC for each clinic. Please see Appendix 1 for more details on the weighting. We calculated clinic-specific AUCs based on both conventional ROC and WROC (AUCROC and AUCWROC) with 95% confidence intervals (CIs) estimated by bootstrap percentiles from 10,000 repetitions. We assessed the difference in the amount of variation between clinics by calculating the heterogeneity statistic, \({I}^{2}\), of both \({\mathrm{AUC}}_{\mathrm{ROC}}\) and \({\mathrm{AUC}}_{\mathrm{WROC}}\) estimates, separately, as well as the between-clinic variances, \({\tau }_{\mathrm{ROC}}^{2}\) and \({\tau }_{\mathrm{WROC}}^{2}\), as defined in a random effects meta-analysis [28, 29] using the DerSimonian and Laird method [30].

We also estimated the age-specific ROC curves using a semi-parametric approach [31] and bootstrap confidence intervals.

Software

All analyses were performed using R software [32] version 4.2.0, where we used the R packages WeightedROC [33] for estimation of weighted ROC curves and meta [34] for assessing between-clinic variation.

Results

Table 1 shows the basic descriptive characteristics of the 4805 transferred embryos from 4086 treatments in the external and the reference populations. The four clinics contributing to the external population have FH + success rates ranging from 25.9 to 48.7%. It is apparent from the additional characteristics in Table 1 that the clinics adhere to quite different protocols. The table shows that the maternal age distributions differed between the clinics, since the median ages were substantially different. This is of particular interest for our analysis, since we will be standardizing the AUCs according to the age distributions. There were also differences in the use of fresh vs. vitrified embryos as well as the choice of insemination method (IVF vs. ICSI).

Table 1 Descriptive statistics of the transferred embryos from the external clinics and the reference population

To assess, in more detail, how the age distributions differed, we show the estimated density functions for each external clinic and the reference population in Fig. 2. From this plot, it is apparent that the clinics differed in both the variability and the shape of the maternal age distributions. It is also evident that there was some lack of overlap in the lower tails of the distributions, in particular for clinic 2.

Fig. 2
figure 2

Clinic-specific maternal age distributions of the reference population and the external clinics for FH- and FH + embryos

The clinic-specific AUCs based on both the ROC and WROC methods along with their bootstrap CIs are shown by the squares and whiskers in the forest plot in Fig. 3. The \({\mathrm{AUC}}_{\mathrm{ROC}}\) estimates varied substantially from 0.58 to 0.69, although with quite wide CIs. Clinic 3 showed a clearly lower AUC than the other three, which had some overlap in the CIs. The heterogeneity of the \({\mathrm{AUC}}_{\mathrm{ROC}}\) estimates was very large with \({I}^{2}=0.90\).

Fig. 3
figure 3

Estimates of clinic-specific ROC (orange color) and WROC (blue color) AUCs with 95% CIs

The \({\mathrm{AUC}}_{\mathrm{WROC}}\) estimates resulting from age-standardizing to the reference population varied from 0.60 to 0.71. The \({\mathrm{AUC}}_{\mathrm{WROC}}\) estimates were all higher than the \({\mathrm{AUC}}_{\mathrm{ROC}}\) estimates except for one (clinic 1) where there was almost no difference. This was in accordance with the relatively similar age distributions observed for this clinic and the reference population in Fig. 2. Clinic 3 still has a markedly lower age-standardized AUC than the other three clinics, which have more similar AUC values after standardization. The \({I}^{2}\) value for the \({\mathrm{AUC}}_{\mathrm{WROC}}\) estimates was 0.85, somewhat lower than the 0.90 for the \({\mathrm{AUC}}_{\mathrm{ROC}}\) estimates, and the estimated between-clinic variances were \({\widehat{\tau }}_{\mathrm{ROC}}^{2}=0.0026\) and \({\widehat{\tau }}_{\mathrm{WROC}}^{2}=0.0022\), representing a 16% decrease.

The mean of the clinic-specific \({\mathrm{AUC}}_{\mathrm{ROC}}\) estimates was 0.65 (95% CI: 0.61 to 0.68), while the mean of the \({\mathrm{AUC}}_{\mathrm{WROC}}\) estimates was 0.67 (95% CI: 0.63 to 0.71), which was equal to the conventional \({\mathrm{AUC}}_{\mathrm{ROC}}\) of the reference population up to two decimal places. This comparison was only possible because we have access to both model predictions and outcome data for the reference population in this study.

The relationship between model discrimination performance and maternal age within each clinic is illustrated in Fig. 4, which shows the age-specific AUC for each clinic. This figure shows that there was an association between maternal age and model performance but that maternal age affected model performance in quite different ways in the clinics, although the curves are subject to considerable uncertainty as reflected by the wide point-wise confidence intervals.

Fig. 4
figure 4

Estimated age-specific AUC for the external clinics. The solid line shows the estimated AUC values, while the dotted lines show the 95% confidence intervals

Discussion

Comparisons of the performance of an AI model between different clinics are prone to suffer from biases introduced by differences in patient characteristics and other clinic-specific conditions. However, the evaluation of a model’s performance in different settings constitutes an important task for validation purposes as well as for obtaining knowledge about the variability of the performance.

In the existing literature, it is not uncommon to see direct comparisons of conventional ROC AUC between clinics. However, as we have argued, this does not give a complete and unbiased picture of the complex question of how model performance varies between clinics.

In this study, we applied a method based on weighted ROC to age-standardize the performance estimates between several different clinics thereby enabling a comparison that accounts for the inherent differences in maternal age between the clinics. Furthermore, we calculated the mean of the clinic-specific \({\mathrm{AUC}}_{\mathrm{WROC}}\) estimates as a summary measure of model performance across clinics. We found that the estimates of clinic-specific performance tended to vary less after applying standardization as compared to the conventional estimates, indicating that the standardization method reduces the between-clinic variability of the measured performance. It also turned out that in our data, the mean of the age-standardized AUC estimates was almost identical to the conventional AUC of the reference population, although this is not a result that could be expected to hold in general.

One obvious limitation of the approach we have used is that the adjusted value of each clinic-specific performance estimate depends on the particular age distribution in the reference population. On the other hand, this enables the comparison of performance in different settings whenever the age distribution is known in the comparative setting. In our case, all clinic-specific estimates increased as a result of the standardization, but this does not necessarily imply that the AUCs based on this approach are more meaningful than the unweighted estimates. However, it is arguably a more fair comparison of the model’s ability to rank embryos within treatment cycles where maternal age is constant, since it mitigates the effect of the maternal age differences between clinics. In fact, when we consider the age distributions and age-specific AUCs within clinics, it seems that for each clinic, the performance is lower in the age groups that are over-represented in that clinic compared to the reference.

Another limitation of the proposed method is that it relies on estimates of the age densities to account for the differences in age distributions, and, as such, the method might become unstable if there are maternal age ranges with low numbers of embryos. Finally, for this particular application of the method, the exclusion criteria that we have used in the study imply that our results do not necessarily apply to donor oocytes and to patients younger than 21 or older than 44 years.

The methods we propose in this paper are not restricted to the specific model (iDAScore) that we are evaluating. Numerous AI models, both commercial and academic, have been proposed for ranking of embryos. Some models are based solely on image data (single images or time-lapse), while others include other information relevant for implantation prediction. While the inclusion of patient characteristics, such as maternal age, in a model has the potential to improve the prediction of implantation likelihood, it might also hamper the generalizability of the model. Moreover, factors which are constant within a treatment do not have much potential for improving the ranking of embryos within the treatment. The iDAScore model is based on time-lapse images only, as this model is intended to give an objective and consistent evaluation and ranking of embryos within a treatment and not to predict the specific probability of implantation if the embryos are transferred.

Nor is the proposed method for comparing model performance between clinics restricted to standardizing on age distributions. The standardization can be done on any relevant characteristic as long as it is not used as input to the model that is being evaluated, and non-continuous variables could easily be handled as well. In fact, the methods could even be extended to account for multiple variables simultaneously by calculating the weights based on a multivariate probability distribution.

To facilitate the ability to objectively compare performance estimates from different settings would require the establishment of a common reference age distribution that could then be used as a standard within the field. Such a reference could, for example, be based on IVF data registries like the ESHRE data group [35].

For an individual clinic, it is of potential interest to compare the performance of a model to the performance of the same model in other clinics. But in such a comparison, it is essential to consider and account for the differences in patient characteristics. Therefore, comparisons should be performed on standardized rather than conventional AUC values. Furthermore, performance evaluation in different clinics is an integrated part of developing a model, and standardized AUCs provide a tool to obtain even better and more generalizable predictive algorithms within the field of embryo evaluation.