1 Introduction

During the past decade, the number of exoplanet discoveries has increased exponentially, bringing the total number of confirmed exoplanets to more than 5000 by mid-2022. Numerous space missions are contributing to the effort of detecting new exoplanets, such as Kepler [1, 2], TESS [3], CHEOPS [4], PLATO [5], GAIA [6], together with ground instrumentation such as HARPS [7], WASP [8], KELT [9], and OGLE [10]. Over time, the field emphasis has gradually expanded from the determination of bulk planetary parameters to the search for a deeper understanding of the true nature of exoplanets and their formation-evolution histories.

Multiband photometry and spectroscopy of transiting exoplanets are currently the most promising techniques for characterizing the composition and thermodynamics of exoplanet atmospheres [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30], as they allow us to effectively separate the signal of the planet from that of its host star. Observations in the near- to mid-infrared can probe the neutral atmospheres of exoplanets to study the signal from the rovibrational transitions of molecules [15, 31].

Current instrumentation has enabled this kind of atmospheric characterization only for a few tens of planets orbiting close to their host stars over a limited wavelength range [e.g. 17, 19, 32, 33]. A considerable contribution to exoplanetary science will come from the James Webb Space Telescope (JWST), launched in December 2021 [34], and Ariel. JWST provides broadband spectroscopy in the range of 0.6 to 28.5 micron of the electromagnetic spectrum, sufficient to detect all molecular species [31, 35,36,37,38,39].

1.1 Ariel and its Tiers

The Atmospheric Remote-Sensing Infrared Exoplanet Large-survey, Ariel, will launch in 2029 as the M4 ESA mission of the Cosmic Vision program [40, Ariel Definition Study ReportFootnote 1]. Ariel will conduct the first unbiased survey of a statistically significant sample of approximately 1000 transiting exoplanet atmospheres in the 0.5–7.8 \(\mu m\) wavelength range. Three photometers (VISPhot, 0.5–0.6 \(\mu m\); FGS1, 0.6–0.80 \(\mu m\); FGS2, 0.80–1.1 \(\mu m\)) and three spectrometers (NIRSpec, 1.1–1.95 \(\mu m\) and R \(\ge \) 15; AIRS-CH0, 1.95–3.9 \(\mu m\) and R \(\ge \) 100; AIRS-CH1, 3.9–7.8 \(\mu m\) and R \(\ge \) 30), provide simultaneous coverage of the whole spectral band. This broad spectral range encompasses the emission peak of hot and warm exoplanets and the spectral signatures of the main expected atmospheric gases such as H\(_2\)O, CO\(_2\), CH\(_4\), NH\(_3\), HCN, H\(_2\)S, TiO, VO [e.g. 15, 31]. Ariel will allow us to comprehensively understand the formation-evolution histories of exoplanets as well as to extend comparative planetology beyond the boundary of the Solar System.

After each observation, the resulting spectrum from each spectrometer is binned during data analysis to optimize the signal-to-noise ratio (S/N). Therefore, by implementing different binning options, the mission will adopt a four-Tier observation strategy, expected to produce spectra with different S/N to optimize the science return. Tier 1 is a shallow reconnaissance survey created to perform transit and eclipse spectroscopy on all targets to address questions for which a large population of objects needs to be observed. Tier 1 spectra have S/N \(\ge \) 7 when raw spectra are binned into a single spectral point in NIRSpec, two in AIRS-CH0, and one in AIRS-CH1, for a total of seven effective photometric data points. A subset of Tier 1 planets will be further observed to reach S/N \(\ge \) 7 at higher spectral resolution in Tier 2 and Tier 3 for detailed chemical and thermodynamic characterization of the atmosphere. Tier 4 is designed for bespoke or phase-curve observations [41].

1.2 Detecting molecules in Tier 1 spectra

Among the main goals of Tier 1 observations is to identify planetary spectra that show no molecular absorption features (because of clouds or compact atmospheres) and to select those to be reobserved in higher Tiers for a detailed characterization of their atmospheric composition and thermodynamics. Tier 1 observations, however, have a much richer information content even though the combination of S/N and spectral resolution might not be adequate to constrain chemical abundances with high confidence using retrieval techniques.

Adapting existing data analysis techniques or developing new methodologies can be essential to extract all relevant information from the Tier 1 data set. In a previous study, [42] were successful in demonstrating, using color-color diagrams, that Tier 1 observations can be used to infer the presence of molecules in the atmospheres of gaseous exoplanets, independently from planet parameters such as mass, size, and temperature. However, their method has an estimator bias that depends on the magnitude of the instrumental noise; a detailed characterization of instrumental uncertainties is required to remove the estimator bias before it can be used for quantitative predictions. In this follow-up paper, we develop a new method that is both reliable and unbiased to address the following question: can we use Tier 1 transmission spectra to identify the presence of a molecule, with an associated calibrated probability?. Hence, these calibrated probabilities can also be used to inform the decision-making process to select Tier 1 targets for re-observation in Ariel’s higher Tiers for detailed characterization.

Section 2 outlines the methodology used in this analysis. Section 2.1 describes our data analysis strategy for detecting a molecule in these spectra. Section 2.2 details our experimental data set, including the planetary population, forward model parameters, atmosphere randomization, and noise estimation. Section 2.3 summarizes the spectral retrievals performed, discussing the optimization algorithm and the priors used. Section 2.5 describes the data analysis tools used to evaluate the probability forecasts of the method. Section 3 details the results obtained in terms of forecast reliability (Section 3.1), predictive power (Section 3.2), and bias of the abundance estimator utilized (Section 3.3). Finally, Section 4 discusses all the results, and Section 5 summarizes the main conclusions of this analysis.

2 Methods

Tier 1 transmission spectra contain sufficient information to infer the presence of several atmospheric molecules [42], but Tier 1 observations are in general non-ideal for quantitative spectral retrievals in terms of molecular abundances, as they are required to achieve a S/N \(\ge 7\) when binned in only seven effective photometric data points in the 0.5-7.8 \(\mu m\) wavelength range [41]. Abundance posterior probabilities from retrievals can however still be informative and here we develop a new method to identify the presence of molecules in Tier 1 transmission spectra starting from these posteriors.

2.1 Analysis strategy

Given a marginalized posterior distribution of a molecular abundance, we compute an empirical probability, P, that the molecule is present in the atmosphere of a planet, with an abundance above some threshold, \(\mathbb {T}_{Ab}\), as:

$$\begin{aligned} P \simeq \int _{\mathbb {T}_{Ab}}^{\infty } \mathcal {P}(x) d x \end{aligned}$$
(1)

where \(\mathcal {P}\) is the marginalized posterior distribution and x represents the abundance values. Thus, the predicted P depends on the assumed atmospheric model and the selected abundance threshold \(\mathbb {T}_{Ab}\). If the assumed atmospheric model is representative of the observed atmosphere, then a clear correlation (above noise) between P and the true abundance in Tier 1 data implies that P can be used to identify the most likely spectra that contain a molecule, providing a preliminary classification of planets by their molecular content. Thus, this P-statistic can be considered robust [43], even when \(\mathcal {P}(x)\) is too broad to constrain the abundance.

To test whether this method is sensitive enough, we need to simulate transmission spectra as observed in Tier 1, using an atmospheric model that includes a certain number of molecules. Then, we need to perform a spectral retrieval with the same atmospheric model and compare each input molecular abundance with the predicted P corresponding to that molecule. The test is successful if, for an agreed \(\mathbb {T}_{Ab}\), we recover a high P for each large input abundance and a low P for each small input abundance. To understand how well the method behaves under conditions similar to the Ariel reconnaissance survey, we repeat this test on a large and diverse planetary population.

In this study, we employ a simulated population of approximately 300 transmission spectra of H\(_2\)-He gaseous planets, which contain CH\(_4\), H\(_2\)O, and CO\(_2\) trace gases with randomized input abundances. Additionally, we introduce NH\(_3\) with randomized abundances as a nuisance parameter since its spectral features overlap with those of water and other molecules. We utilize NH\(_3\) to test the P-statistic’s efficacy and investigate the robustness of its predictions under various assumptions, such as the exclusion of NH\(_3\) from retrievals or the inclusion of additional molecules not present in the population.

Therefore, we can study whether this method provides reliable predictions under less favorable conditions when the assumed model is not fully representative of the observed atmosphere. This might provide some insight into how robustly the method can reveal the presence of a molecule in a real observation when the atmosphere is unknown. For this, we add or remove molecules from the retrieval model (hereafter, “fit-composition”) with respect to the simulated composition. Then, we perform different spectral retrievals, that use different fit-compositions, and compare the predictions obtained from the P-statistic with the input abundances.

2.1.1 Model exploration

We consider three cases in our analysis. In the first case (referred to as \(\textrm{R}_0\)), we use an atmospheric model that includes CH\(_4\), H\(_2\)O, CO\(_2\), and NH\(_3\) as trace gases, which matches the composition used in the forward model generation of the population.

In the second case (referred to as \(\textrm{R}_1\)), we consider a fit-composition that includes only CH4, CO2, and H2O, omitting NH3. In this case, there is a possibility of inadequate representation of the data because NH\(_3\)’s molecular features could overlap with the observed features of other molecules (hence its adoption as a nuisance), particularly H\(_2\)O [31]. As a result, the retrieved values of P may not accurately reflect the input abundances of H\(_2\)O, leading to decreased reliability of the predictions.

In the third case (referred to as \(\textrm{R}_2\)), we expand the fit-composition beyond the input composition by including also CO, HCN, and H\(_2\)S. It should be noted that the spectral features of these additional molecules could also overlap with the observed features of the other molecules. For instance, CO and CO\(_2\) exhibit a spectral overlap around \(4.5\,\mu m\). Hence, even in this case, obtaining reliable predictions of the input composition may not be obvious.

Table 1 provides a summary of the molecules included in the fit-composition for each retrieval. For more detailed information on the retrievals performed, please refer to Section 2.3.

Table 1 Molecules included in the fit-composition for each retrieval

2.2 Experimental data set

As a simulated population, we use a planetary population generated using the Alfnoor software [42, 44]. Alfnoor is a wrapper of TauREx 3 [45] and ArielRad [46]. Given a list of candidate targets and a model of the Ariel payload, it automatically computes simulated exoplanet spectra as observed in each Ariel Tier.

Specifically, we use a subset of the POP-I planetary population of [42]. POP-I consists of 1000 planets from a possible realization of the Ariel Mission Reference Sample (MRS) of [41]. That MRS (hereafter, MRS19) comprises known planets in 2019 from NASA’s Exoplanet Archive and TESS forecast discoveries. Here we ignore the TESS forecasts, thus obtaining a sub-population of around 300 planets, that we label POP-Is. Using POP-Is planets ensures that, in principle, we can compare our results with those of [42].

Figure 1 shows that POP-Is comprises a diverse sample of planets mostly with large radii (\(\gtrsim \) 5 \(\textrm{R}_{\oplus }\)), short orbital periods (\(\le \) 4/5 days), warm to hot equilibrium temperatures (500 – 2500 \(^{\circ } K\)) and stellar hosts with different magnitudes in the K band of the infrared spectrum (8 – 12 \(m_K\)). Compared to the parameter space sampled by the entire POP-I, this data set has more occasional statistics on smaller and longer-period planets around brighter stars.

Fig. 1
figure 1

Parameter space distribution of the POP-Is planetary population used in this work, which comprises about 300 selected planets from MRS19. The horizontal axis reports the planetary orbital period in days; the vertical axis reports the stellar magnitude in the K band. Each data point represents a planet; the symbol size is proportional to the planetary radius in Earth’s radii; the symbol color shows the expected planetary equilibrium temperature. Light blue data points in the background show the entire MRS19/POP-I parameter space for reference

The detailed properties of POP-I (and therefore POP-Is) are discussed in [42] and briefly summarized here. The forward model parameters are randomized to test diverse planetary atmospheres. The baseline atmosphere is a primordial atmosphere filled with H\(_2\) and He with a solar mixing ratio of He/H\(_2\) = 0.17. The vertical structure of the atmosphere comprises 100 pressure layers, uniformly distributed in log space from 10\(^{-4}\) to 10\(^{6}\) Pa, using the plane-parallel approximation. The equilibrium temperature of each planet is randomized between \(0.7 \times T_{p}\) and \(1.05 \times T_{p}\), where \(T_{p}\) is the equilibrium temperature of the planet listed in MRS19; the atmospheric temperature-pressure profile is isothermal. Constant vertical chemical profiles are added for H\(_2\)O, CO\(_2\), CH\(_4\), and NH\(_3\), with abundances randomized according to a logarithmic uniform distribution spanning 10\(^{-7}\) to 10\(^{-2}\) in Vertical Mixing Ratios (VMR). Randomly generated opaque gray clouds are also added with a surface pressure varying from 5\(\times \)10\(^{2}\) to 10\(^{6}\) Pa to simulate cloudless to overcast atmospheres. Table 2 summarizes the randomized parameters of the POP-I forward models. For each planet, POP-I contains the raw spectrum binned at each Ariel Tier resolution (“noiseless spectra”), the associated noise predicted by the Ariel radiometric simulator, ArielRad, for each spectral bin, and the number of transit observations expected to reach the Tier-required S/N. To simulate an observation, we scatter the noiseless spectra according to a normal distribution with a standard deviation equal to the noise at each spectral bin. The “observed spectra” data set is built by repeating this process for each planet in POP-Is. As in [42], the Tier 1 data used in this work are binned on the higher resolution Tier 3 spectral grid: R = 20, 100, and 30, in NIRSpec, AIRS-CH0, and AIRS-CH1, respectively. The noise is that of Tier 1, which yields a S/N > 7 if data were binned on the Tier 1 spectral grid. This is to prevent the loss of spectral information that may occur in binning.

Table 2 Forward model randomized parameters in POP-I

2.3 Retrievals summary

To perform the retrievals, we use the TauREx 3 retrieval framework [45], the same used to generate the raw POP-Is spectra. In the retrieval model, we include opaque gray clouds, pressure-dependent molecular opacities of various trace gases, Rayleigh scattering, and Collision-Induced Absorption (CIA) of H\(_2\)-H\(_2\) and H\(_2\)-He. Table 3 reports a referenced list of CIA and all molecular opacities used in this study.

Table 3 List of opacities used in this work and their references

The free parameters of the retrievals are the radius and mass of the planet, as well as the molecular mixing ratios, as listed in Table 4. We use broad logarithmic uniform priors for the molecular abundances, ranging from 10\(^{-12}\) to 10\(^{-1}\) in VMR. For the mass and radius of the planet, we select uniform priors of 20\(\%\) and 10\(\%\) around the respective values listed in MRS19. The gray cloud pressure levels are not included as free parameters in the retrieval because of their degeneracy with other parameters such as the radius [60].

Table 4 Fit parameters and their priors for the retrievals

We set the evidence tolerance to 0.5 and sample the parameter space through 1500 live points using the Multinest algorithmFootnote 2 [61, 62]. We disable the search for multiple modes to obtain a single marginalized posterior distribution of each molecular abundance to insert in Eq. 1.

We then perform the three different retrievals (respectively R\(_0\), R\(_1\), and R\(_2\)) described in Section 2.1 on each POP-Is planet. We use the Atmospheric Detectability Index (ADI) [19] to assign statistical significance to the results of these retrievals. Given the Bayesian evidence of a nominal retrieval model, \(E_{N}\), and of a pure-cloud/no-atmosphere model, \(E_{F}\), the ADI is:

$$\begin{aligned} \text {ADI} = \ {\left\{ \begin{array}{ll} log(E_{N}) - log(E_{F}), \text {if } log(E_{N}) > log(E_{F}) \\ 0, \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

ADI is a positively defined metric, equivalent to the log-Bayesian factor [63, 64] where log(\(E_{N}\)) > log(\(E_{F}\)). To compute \(E_{F}\), we perform an additional retrieval for each planet with a flat-line model with the planet radius being the only free parameter.

2.4 Abundance threshold

We utilized the marginalized posteriors to estimate the P-statistic using an abundance threshold of \(\mathbb {T}_{Ab} = 10^{-5}\), which is considered “molecular-poor” according to the definition by [42]. This threshold is higher by 1-2 orders of magnitude compared to the Tier-2 detection limits reported by [44]. The “molecular-poor” condition is met for approximately 40% of the atmospheres due to the randomization boundaries set for each molecule (see Table 2). The ability to detect a molecule depends on factors such as opacities, correlations among molecules, and noise in the measured spectrum. Therefore, \(\mathbb {T}_{Ab}\) can be optimized for each molecule in future work, although we applied the same abundance threshold for all in this pilot study.

2.5 Data analysis tools

The P-statistic can be used to reliably classify planets for the presence of a molecule with an abundance above \(\mathbb {T}_{Ab}\) when P correlates with the Ab true value. The stronger the correlation above noise fluctuations, the larger the predictive power. Because this classification is binary and P is defined in the range \(0 \rightarrow 1\), we can use standard statistical tools such as calibration curves and ROC curves [65, 66] to evaluate the performance of this method in revealing the presence of molecules and in selecting Tier 1 targets for higher Tiers. These curves are routinely utilized by the Machine Learning communityFootnote 3, as they present the forecast quality of a binary classifier in a well-designed graphical format.

2.5.1 Calibration curves

A calibration curve [e.g. 66] plots the forecast probability averaged in different bins on the horizontal axis and the fraction of positives, in each bin, on the vertical axis (see Fig. 2 for a generic example). In this work, the fraction of positives is the fraction of POP-Is planets with true abundance larger than \(\mathbb {T}_{Ab}\), and the forecast probability is the corresponding P-statistic. Calibration curves provide an immediate visual diagnosis of the quality of binary classifier forecasts and the biases that the forecasts may exhibit.

Fig. 2
figure 2

Calibration curves of three mock classifiers, exhibiting different forecast quality and biases. The legend reports the B-S of the forecasts of each classifier. The calibration curve for perfectly calibrated forecasts is reported for reference

For well-calibrated predictions, the forecast probability is equal to the fraction of positives, except for deviations consistent with sampling variability. Therefore, the ideal calibration curve follows the 1:1 line. Miscalibrated forecasts can be biased differently depending on whether the calibration curve lies on the left or on the right of the 1:1 line. A curve entirely to the right of the 1:1 line indicates an over-forecasting bias, as the forecasts are consistently too large relative to the fraction of positives, as seen in the calibration curve of Classifier 1 in Fig. 2. On the contrary, the calibration curve of Classifier 2 shows the characteristic signature of under-forecasting, being entirely on the left of the 1:1 line, indicating that the forecasts are consistently too small relative to the fraction of positives. There may also be more subtle deficiencies in forecast performance, such as an under-confident forecast, with over-forecasting biases associated with lower probabilities and under-forecasting biases associated with higher probabilities, as seen in the calibration curve of Classifier 3.

Calibration curves paint a detailed picture of forecast performance, often summarized in a scalar metric known as the Brier Score [B-S, 68], which is defined as the mean square difference between probability forecasts and true class labels (positive or negative); the lower the B-S, the better the predictions are calibrated. From Fig. 2, we see that Classifier 3 achieves the best B-S, although the forecasts are not well calibrated. In general, uncalibrated forecasts can be calibrated using calibration methods such as Platt scaling and Isotonic regression [69,70,71].

Table 5 Contingency table formulating all four possible outcomes of a binary classification problem

2.5.2 ROC curves

Given the predicted probabilities of a classifier, and a selected probability threshold \(\mathbb {P}\), the number of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN), are defined in Table 5.

A binary classifier with high predictive power assigns larger P to positive observations (true label “Yes") and smaller P to negative (true label “No"). This maximizes TP and TN, and minimizes FP and FN.

Fig. 3
figure 3

ROC curves of the same mock classifiers shown in Fig. 2, exhibiting different predictive powers. The legend reports the AUC associated with each ROC curve. The ideal and worst possible classifier ROC curves are reported for reference. Several probability thresholds \(\mathbb {P}\) at regularly spaced intervals are also displayed on each curve

A ROC curve [e.g. 66] is a square diagram that illustrates the predictive power at different values of the probability threshold \(\mathbb {P}\). It plots the False Positive Rate (FPR) on the horizontal axis and the True Positive Rate (TPR) on the vertical axis (see Fig. 3 for a generic example), defined as:

$$\begin{aligned} \text {FPR} = \frac{\textrm {FP}}{\text {Negatives}} = \frac{\textrm {FP}}{\text {FP} + \textrm {TN}} \end{aligned}$$
(3a)
$$\begin{aligned} \text {TPR} = \frac{\text {TP}}{\textrm {Positives}} = \frac{\text {TP}}{\textrm {TP} + \textrm {FN}} \end{aligned}$$
(3b)

FPR and TPR are commonly known as “false alarm” and “hit” rates. ROC curves are constructed by calculating the TPR and FPR from the number of TP, TN, FP, and FN as \(\mathbb {P}\) decreases from 1 to 0. The ideal classifier minimizes the FPR while maximizing the TPR; thus, its ROC curve is the unit step function. On the other hand, the worst possible classifier is a random classifier with a ROC curve along the 1:1 line. Real-world classifiers have intermediate ROC curves ranked by how close they are to the unit step function. As seen in Fig. 3, Classifier 3 exhibits the highest predictive power, as the corresponding ROC curve arcs everywhere above the ROC curves for Classifiers 1 and 2.

ROC curves portray a detailed picture of predictive power, often summarized in a scalar metric known as the Area Under the Curve (AUC), the fraction of the unit square area subtended by a ROC curve. The higher the AUC, the higher the predictive power. The ideal classifier has \(\text {AUC} = 1.0\); the random one has \(\text {AUC} = 0.5\). From Fig. 3, we see that, as expected, Classifier 3 also achieves the largest AUC.

ROC curves can also be used to select the optimal classification threshold \(\mathbb {P}\), which roughly corresponds to the position on the curve where the TPR cannot be raised without significantly increasing the FPR. For example, as seen in Fig. 3, the optimal \(\mathbb {P}\) for Classifier 3 is around 0.5, where it achieves a TPR of nearly 0.9 at a low FPR of approximately 0.1. Reducing \(\mathbb {P}\) to 0.4 is not advantageous, as it only increases the TPR to approximately 0.95, at the expense of increasing the FPR to almost 0.3.

2.6 Using calibration and ROC curves

Using calibration curves and the B-S metric, we can immediately diagnose the forecast quality of the P-statistic and its potential biases. Suppose that the forecast probability P matches the fraction of planets with input abundances greater than \(\mathbb {T}_{Ab}\) (fraction of positives) in each probability bin. In that case, the prediction of the method is well-calibrated. Moreover, we can compare the forecast quality achieved for different molecules using the B-S metric. If the forecasts are not well calibrated, we can infer which kind of bias affects the predictions of the method by inspecting the shape of the calibration curve. If the forecasts show an over-forecasting bias (as in the example of Classifier 1, Fig. 2) and therefore incorrectly classify a fraction of planets as bearing a molecule, too many Tier 1 planets may be selected for re-observation in higher Tiers, resulting in less optimal scheduling of observations. On the contrary, an under-forecasting bias (as in the example of Classifier 2, Fig. 2) may imply that fewer Tier 1 planets than possible would be scheduled for re-observing in higher Tiers.

Using ROC curves and the AUC metric, the power of the P-statistic to predict the presence of molecules can be assessed. The closer the ROC curve approaches the unit step function (AUC \(\simeq 1\), Fig. 3), the higher the predictive power. Moreover, we can directly compare the predictive power achieved for different molecules by analyzing the shape of the corresponding ROC curves and the AUC values.

The shape of the ROC curve provides a way to select the optimal classification threshold, \(\mathbb {P}_{*}\), for the problem under study. For instance, \(\mathbb {P}_{*}\) can be chosen in a trade-off process that maximizes the TPR while keeping the FPR at an acceptable low value.

This choice can aid the selection of Tier 1 targets for re-observation in a higher Tier: a large FPR would result in a poor allocation of observing time while a low TPR would result in a reduction of observational opportunities. It can also benefit population studies where one might need to track the presence of certain molecules across families of planets and extrasolar systems. These types of studies are outside the scope of this work, but can profit from the methodology developed here.

Fig. 4
figure 4

Detection reliability analysis for CH\(_4\), H\(_2\)O, and CO\(_2\) from the \(\textrm{R}_0\) retrievals, that implement a model that is fully representative of the simulated atmospheres. All plots in the same column share the same horizontal axis with the predicted probabilities, \(P(Ab_{mol}\) > \(10^{-5})\), that a molecule is present in the atmosphere of a planet, with an abundance above the selected abundance threshold, \(\mathbb {T}_{Ab} = 10^{-5}\). Top row: histogram with the frequency of the P forecasts. Middle row: diagrams showing the correlation between P values on the horizontal axis and input abundances on the vertical axis. The linear fit parameters of the data points are reported on each legend. For visual reference, the dotted horizontal lines show the position of \(\mathbb {T}_{Ab}\) and the dotted vertical lines the value 0.5 on the x-axis. Bottom row: calibration curves with associated bootstrap confidence intervals; each legend shows the B-S of the forecasts

3 Results

As detailed in Section 2.1, we designed a method based on the P-statistic to reveal the presence of a molecule in Tier 1 spectra. In the following sections, we use the statistical tools described in Section 2.5 to show the performance of the P-statistic in predicting the presence of several molecules in our simulated planetary population. In particular, in Section 3.1, we use calibration curves to assess the reliability of the predictions of the method and related biases, while in Section 3.2, we use ROC curves to assess the predictive power of the method and discuss the optimal classification threshold, \(\mathbb {P}_{*}\). In Section 3.3, we use the median abundance as an estimator of the true abundance and investigate its biases in the low S/N regime to explain the biases observed in the calibration curves.

3.1 Detection reliability

3.1.1 Retrieval \(\textrm{R}_0\)

Figure 4 shows the analysis performed to evaluate the reliability of the method when using the abundance posteriors of the retrieval \(\textrm{R}_0\), which uses the same atmospheric composition as the one used in the generation of the simulated atmospheres (see Table 1). The subplots in each column share the same horizontal axis with the predicted probability P that a molecule is present with an input abundance, \(Ab_{mol}\), above the selected abundance threshold \(\mathbb {T}_{Ab} = 10^{-5}\) (see Section 2.4). The figure reports the results for CH\(_4\), H\(_2\)O, and CO\(_2\), shown from left to right, respectively.

The top row displays histograms of the P-statistic realizations, which exhibit a bimodal distribution. Two peaks are observed in the distribution, with one located at \(P \approx 0.2\) and the other at \(P \approx 0.8\), with the former being more prominent. Additionally, a valley is observed at intermediate values, with \(P \approx 0.5\).

The middle row shows the correlation between the predicted probabilities on the horizontal axis and the input abundances of each molecule on the vertical axis. We take a rough measure of the correlation by calculating the angular coefficient of the data points from a linear fit. These coefficients are listed in Table 6. The lower right quadrant of these diagrams (\(P \gtrsim 0.5\) and \(Ab_{mol} < 10^{-5}\)) is almost empty of data points, indicating that whenever the method predicts a high P, the corresponding input abundance is likely higher than \(\mathbb {T}_{Ab}\). However, not all planets with an input abundance greater than \(\mathbb {T}_{Ab}\) are associated with a high P, as the upper left quadrants of these diagrams (\(P\lesssim 0.5\) and \(Ab_{mol} > 10^{-5}\)) are not empty of data points.

The bottom row shows the calibration curves computed for each molecule; each curve is shown with a bootstrap confidence interval calculated using 1000 bootstrap samples. That is, following [72], we randomly remove \(\sim 1/e \approx 36\%\) of the data from each of these samples and replace them by repeating some randomly chosen instances of the ones kept. For each molecule, we calculate the B-S using the \(\texttt {brier\_score\_loss}\) method of \(\texttt {sklearn.metrics}\) [67], with the associated uncertainty estimated from the same bootstrap samples. Table 6 lists the B-S values obtained.

Table 6 Best-fit value for the angular coefficient m from the linear fit \(\log (Ab_{mol}) \propto m \ P(Ab_{mol} > \mathbb {T}_{Ab})\), with \(\mathbb {T}_{Ab} = 10^{-5}\), and Brier Score for the calibration curves for all possible combinations of retrievals and molecules

The calibration curves show an under-forecasting bias (curve to the left of the 1:1 line; see Section 2.5.1) especially associated with larger forecast probabilities, giving a fraction of positives \(\approx 1.0\) for \(P \gtrsim 0.6\). On the contrary, the probabilities are better calibrated for \(P \lesssim 0.4\). From the B-S values (less accurate forecasts receive higher B-S), we see that CH\(_4\) is the best-scoring molecule, probably due to its strong absorption spectral features.

It is possible that the observed under-forecasting of the calibration curves and the bimodality of the P-statistic distribution are both related to the sampling of the parameter space. This is briefly discussed further in Section 4.2.

3.1.2 Retrieval \(\textrm{R}_1\)

Figure 5 shows the same analysis for the retrieval \(\textrm{R}_1\), which includes only CH\(_4\), CO\(_2\), and H\(_2\)O in the fit-composition and excludes NH\(_3\), although this molecule is present in the data set (see Table 1).

Fig. 5
figure 5

Same as Fig. 4. Detection reliability for the \(\textrm{R}_1\) retrievals, implementing a model that excludes NH\(_3\) from the fit-composition

Comparing the histograms from the top row of this figure with those obtained for the retrieval \(\textrm{R}_0\) (Fig. 4), we notice a decrease in the forecast frequency at low P, especially for CH\(_4\) and H\(_2\)O, with a reduced peak at P around 0.2. On the contrary, high values of P are more frequent, enhancing the peak at P around 0.8: for CH\(_4\), more than \(30 \%\) of the data set receives P between 0.8 and 0.9. These are samples with high input abundance.

The plots in the middle row show an increase in the scatter in the data points compared to \(\textrm{R}_0\). In this case, we find a decrease in the correlation between P and the input abundances, and the angular coefficients of the linear fit are reported in Table 6. Planets that receive \(P \gtrsim 0.8\) have high input abundance, \(Ab_{mol} > 10^{-5}\).

The calibration curves for H\(_2\)O and CH\(_4\) in the bottom row are, within the uncertainties, closer to the 1:1 line than for \(\textrm{R}_0\), both for high and low forecast probabilities. Although this might appear closer to the ideal behavior, it could be misleading. The B-S is higher than for \(\textrm{R}_0\), because the mean squared difference between the forecasts and true class labels is larger. This is visualized in the middle plots: for \(Ab_{mol} < 10^{-5}\) (negative true class label), there are many forecast values with \(P > 0.5\). In other words, the correlation between the P-statistic and the true input abundances is weaker. In contrast, the entire CO\(_2\) calibration curve shows the signature of under-forecasting. The curve for CO\(_2\) is almost the same as for \(\textrm{R}_0\), likely because the missing NH\(_3\) affects less the CO\(_2\) abundance posteriors. On the other hand, the overlap of NH\(_3\) with H\(_2\)O but also CH\(_4\) makes the model used in the retrieval less suitable to describe the data.

The reduced correlation between probability forecasts and input abundances, as well as the higher B-S values, suggest that excluding NH\(_3\), despite its presence in the data set, leads to less representative abundance posteriors. However, predictions for CO\(_2\) are less affected, possibly because this trace gas has less spectral overlap with NH\(_3\) compared to H\(_2\)O or CH\(_4\).

3.1.3 Retrieval \(\textrm{R}_2\)

The results of the same analysis for the retrieval \(\textrm{R}_2\), which includes CO, HCN, and H\(_2\)S as additional molecules to the fit-composition (see Table 1) are very similar to those of \(\textrm{R}_0\) (see Section 3.1.1). Therefore, we refer the reader to Table 6 that summarizes the results for the correlation between predicted probabilities and input abundances, along with the B-S values, and to Fig. 13 in Section A of the Appendix.

3.2 Predictor assessment

3.2.1 Retrieval \(\textrm{R}_0\)

Figure 6 shows the analysis performed to assess the predictive power of the P-statistic (ability to maximize TP and TN while minimizing FP and FN) when using the abundance posteriors from the retrieval \(\textrm{R}_0\). The figure reports the results for CH\(_4\), H\(_2\)O, and CO\(_2\), shown in different columns from left to right, respectively.

Fig. 6
figure 6

Predictor assessment analysis for CH\(_4\), H\(_2\)O, and CO\(_2\) from the \(\textrm{R}_0\) retrievals, that implement a model that is fully representative of the simulated atmospheres. Top row: ROC curves with associated bootstrap confidence intervals. The ideal and worst possible classifier ROC curves are reported for reference. The legends report the AUC associated with each ROC curve. Several probability thresholds \(\mathbb {P}\) at regularly spaced intervals are also displayed on each curve. Bottom row: TP, TN, FP, and FN curves plotted as a function of the probability threshold \(\mathbb {P}\), with confidence intervals from the same bootstrap estimation

The upper row shows the calculated ROC curves for each molecule. Each curve is reported with a bootstrap confidence interval calculated using 1000 bootstrap samples, with the same random removal and replacement of the data as discussed in Section 3.1, involving \(1/e \approx 36\%\) of the data. For each molecule, we calculate the AUC using the \(\texttt {roc\_auc\_score}\) method of \(\texttt {sklearn.metrics}\) [67], with the associated uncertainty estimated from the same bootstrap samples. The AUC values thus obtained are collected in Table 7. For all molecules, the ROC curves are close to ideal behavior (curve near the unit step function, see Section 2.5.2), showcasing that the P-statistic has significant predictive power. Consequently, the corresponding AUC values are \(> 0.9\), with no considerable variation between molecules, implying similar predictive power.

Table 7 AUC of the ROC curves and probability odds at the probability threshold \(P = 0.5\) for all possible combinations of retrievals and molecules

For each molecule, the bottom row shows the number of TP, TN, FP, and FN (see Table 5), used to construct the ROC, versus the probability threshold \(\mathbb {P}\). Also shown are the associated confidence intervals estimated from the same bootstrap samples. These diagrams provide information on how the predictive power of the method changes as \(\mathbb {P}\) varies from 1 to 0 and aid in the selection of the optimal classification threshold \(\mathbb {P}_{*}\) (see Section 2.6).

Given the randomization of trace gas abundances in the forward model (\(10^{-7}\) to \(10^{-2}\) on a uniform logarithmic scale, see Table 2), and the selected abundance threshold (\(\mathbb {T}_{Ab} = 10^{-5}\)), the data set contains \(\sim 60 \%\) positive observations and \(\sim 40 \%\) negative observations. By definition, for \(\mathbb {P} = 1\), the number of positive forecasts, \(\mathrm {N}_{\mathrm{P}} = \mathrm{TP} + \mathrm{FP}\), is zero, and the number of negative forecasts, \(\mathrm {N}_{\mathrm{N}} = \mathrm {TN} + \mathrm {FN}\), is equal to the size of the data set. Therefore, at this probability threshold, \(\textrm{TN} \simeq 40 \%\) and \(\textrm{FN} \simeq 60 \%\). As \(\mathbb {P}\) decreases, \(\mathrm {N}_{\mathrm{P}}\) increases (TP and FP increase), while \(\mathrm {N}_{\mathrm{N}}\) decreases (TN and FN decrease). For \(\mathbb {P} = 0\), \(\mathrm {N}_{\mathrm{N}}\) is zero and \(\mathrm {N}_{\mathrm{P}}\) is equal to the data set size; at this classification threshold, \(\textrm{TP} \simeq 60 \%\) and \(\textrm{FP} \simeq 40 \%\).

In those cases where there are no external constraints on which misclassification is more bearable (FP or FN), the intersection of their curves gives an optimized classification threshold \(\mathbb {P}_{*}\).

From this intersection, we obtain \(\mathbb {P}_{*} \approx 0.3\) for all molecules. For confirmation, we can trace this \(\mathbb {P}_{*}\) on the ROC curves. As expected, it roughly corresponds to the point where we cannot significantly increase TPR without increasing FPR, which is at TPR \( \approx 0.8\). If, instead, we need a more conservative number of FP, we can choose a higher \(\mathbb {P}_{*}\), for example \(\mathbb {P}_{*} = 0.5\), the default classification threshold for a binary classifier.

A concise way to demonstrate the effectiveness of the P-statistic in rejecting misclassifications is by computing the odds TP:FP and TN:FN, estimated from the curves in the bottom row of Fig. 6. Odds relate to the probability that a molecule is correctly identified at the selected \(\mathbb {P}\), with an example shown in Table 7, estimated at \(\mathbb {P}_{*} = 0.5\). The table shows that the P-statistic is quite effective in rejecting FP, as they are negligible for all molecules at this threshold. Moreover, TPR at \(\mathbb {P}_{*} = 0.5\) indicates that more than 60% of the positives in the dataset is correctly identified, with TP values of approximately 45%, 35%, and 45% for CH\(_4\), H\(_2\)O, and CO\(_2\), respectively (rounded to the nearest 5% from the odds values listed in the table). However, at this \(\mathbb {P}\), FN increases to approximately 15-25% of the dataset (as seen in the bottom row of Fig. 6 at \(\mathbb {P}_{*} = 0.5\)), resulting in TN:FN odds of less than 3:1.

3.2.2 Retrieval \(\textrm{R}_1\)

Figure 7 shows the same analysis for the retrieval \(\textrm{R}_1\).

Fig. 7
figure 7

Same as Fig. 6. Predictor assessment for the \(\textrm{R}_1\) retrievals, implementing a model that excludes NH\(_3\) from the fit-composition

Comparing the ROC curves in the top row with those obtained for the retrieval \(\textrm{R}_0\) (see Section 3.2.1), we notice a decrease in the predictive power of the method, measured by a reduction in AUC for CH\(_4\) and H\(_2\)O, as reported in Table 7. On the contrary, the CO\(_2\) ROC achieves the highest AUC, similar to that of \(\textrm{R}_0\), possibly caused by the limited overlap between NH\(_3\) and CO\(_2\), when compared to the case of CH\(_4\) and H\(_2\)O.

The plots in the bottom row show a significant reduction in the performance of the FP curve compared to that achieved for \(\textrm{R}_0\): for CH\(_4\) and H\(_2\)O, it is above \(10 \%\) up to \(\mathbb {P} \simeq 0.6\), instead of \(<1 \%\) at \(\mathbb {P} \simeq 0.5\). The TN curve also shows a decrease in performance: it remains below \(30 \%\) to \(\mathbb {P} \simeq 0.6\), instead of reaching \(40 \%\) at \(\mathbb {P} \simeq 0.4\) in \(\textrm{R}_0\). Although the TP and FN curves demonstrate relatively better performance, the optimal classification threshold denoted as \(\mathbb {P}_{*}\), determined at the intersection of the FP and FN curves, increases to approximately \(\mathbb {P}_* \sim 0.65, 0.5, 0.4\) for CH\(_4\), H\(_2\)O, and CO\(_2\), respectively. Tracing these \(\mathbb {P}{*}\) values on the ROC curves reveals that they correspond to a TPR of approximately 0.8 for all molecules, similar to \(\textrm{R}_0\), but with a significantly worse FPR, as a consequence of the reduced predictive power.

Table 7 reflects this, showing the odds of TP:FP and TN:FN at the same probability threshold \(\mathbb {P}_{*} = 0.5\), which was used for \(\textrm{R}_0\). In this case, the method is less efficient in rejecting FP, despite having TP of approximately 50% and 45% for CH\(_4\) and H\(_2\)O, respectively, resulting in only about 3:1 odds for TP:FP. However, the method is still effective in correctly identifying planets with CO\(_2\), with TP:FP odds of about 9:1. As for TN:FN, the results are similar to \(\textrm{R}_0\), with a slightly better rejection of FN in the case of CH\(_4\) (4:1 instead of 3:1).

3.2.3 Retrieval \(\textrm{R}_2\)

The results from the same analysis for the retrieval \(\textrm{R}_2\) are very similar to \(\textrm{R}_0\)’s (see Section 3.2.1). Therefore, we refer the reader to Table 7 that summarizes the AUC values obtained and the odds TP:FP and TN:FN at the probability threshold \(\mathbb {P}_{*} = 0.5\), and to Fig. 14 in Section A of the Appendix.

3.3 Abundance estimates

Tier 1 might not be adequate for reliable abundance retrieval, for which higher Ariel Tiers are better suited. Therefore, we study the retrieved Tier 1 abundances to investigate trends in their distribution that may clarify some of the behavior observed in the calibration and ROC curves seen in the previous sections. The abundance estimator used is obtained from the median of the marginalized posterior distribution of the \(\log Ab_{mol}\) with asymmetric error bars estimated from the \(68.3 \%\) confidence level around the median. In particular, we are interested in investigating the regime of input abundances under which this median-based estimator is unbiased.

3.3.1 Retrieval \(\textrm{R}_0\)

Figure 8 reports the analysis performed to investigate potential biases affecting the median of the marginalized posteriors when used as an estimator of the log-abundances. The figure reports the results for CH\(_4\), H\(_2\)O, and CO\(_2\), shown in different columns from left to right, respectively. NH\(_3\) exhibits similar behavior to the other three molecules, but it is not included in the figure in line with the decision to treat it as a nuisance in this study.

Fig. 8
figure 8

Comparison between the retrieved molecular abundances and their true values is shown from the \(\textrm{R}_0\) retrievals. The estimator for the retrieved log-abundances is the median of the posterior distributions from the retrievals. Top row: retrieved vs. input molecular abundances. The solid black line represents the ideal trend, and the color bar visualizes the distance between input and retrieved abundances in units of uncertainty \(\sigma \). The symbol size is proportional to the S/N in the AIRS-CH0 spectroscopic channel. Middle row: log-abundance S/N vs. the difference between the retrieved and input log-abundances. A black dashed line is drawn at a value of 5 on the vertical axis for visual reference. Bottom row: true abundances vs. the difference between the retrieved and true log-abundances, in units of \(\sigma \). Dashed vertical lines are drawn at 3 and 5-\(\sigma \). Text boxes show the number of 2-, 3-, and 5-\(\sigma \) outliers

Panels in the top row show the molecular log-abundance input vs. the retrieved with the error bar. A solid black line serves as the ideal trend (1:1 line) for visual reference. The color bar indicates the distances between the input and retrieved log-abundance, expressed in units of the uncertainty \(\sigma \) on \(\log Ab_{mol}\), estimated by averaging the asymmetric error bars. Blue colors denote distances up to \(1 \sigma \); red colors represent distances in the range of \(1 \rightarrow 2 \sigma \). Larger distances are marked with black circles, which serve to diagnose potential trends and biases that may affect the retrieval results. In addition, the symbol size reflects the signal-to-noise ratio (S/N) of each observation as estimated in the AIRS-CH0 spectroscopic channel, providing insight into possible trends between the distance to the input abundance and the S/N condition.

The retrieved abundances exhibit good agreement with the input abundances in the large abundance regime, characterized by limited scatter around the ideal trend and by low retrieved uncertainties. This regime is generally observed for \(Ab_{mol} \gtrsim 10^{-4}\), but starts to break down at \(10^{-5} \lesssim Ab_{mol} \lesssim 10^{-4}\). For \(Ab_{mol} \lesssim 10^{-5}\), the input abundances are rarely retrieved accurately. This analysis can provide insights into the detection limits of CH\(_4\), H\(_2\)O, and CO\(_2\) in Ariel Tier 1, which are estimated to be around \(10^{-4}\). These values can be compared with the expected detection limits of the same molecules in Ariel Tier 2, which are anticipated to be significantly lower, with previous studies [44] reporting limits between \(10^{-7}\) and \(10^{-6.5}\).

Let the log-abundance S/N be defined as \(\frac{1}{\sigma }\mid \log {Ab_{mol}}\mid \), where \(Ab_{mol}\) is the true value of the molecular abundance. The middle row panels in Fig. 8 show the plot of log-abundance S/N vs. the difference between the retrieved and input log abundances. It can be observed that the distribution of data points is broadly separated into two sub-populations at a S/N of about 5. Data points with high S/N correspond to cases where the input is confidently retrieved and aligned along the 1:1 line in the upper row diagrams, indicating unbiased estimation. On the other hand, data points with low S/N cluster in the bottom left portion of the diagram. In these cases, the median is no longer an unbiased estimator of the true value, as the corresponding data points lie to the left of the 1:1 line in the upper row diagrams. As discussed further in Section 4.2, these cases have posteriors dominated by the prior imposed in the retrieval and are best treated as upper limits.

In the bottom row of Fig. 8, the true abundances are shown vs. the difference between the retrieved and true abundances, in units of \(\sigma \). The diagrams provide a visualization of how many samples are 2-, 3-, and 5-\(\sigma \) outliers, allowing verification that the distribution is compatible with the tail of the abundance posteriors. The number of outliers is shown in the text box inserted in the diagrams and (converted into percentages) in Table 8. Assuming that the abundance posteriors are representative of the data, the fraction of expected outliers outside is \(5\%\), \(0.3\%\), and \(\ll 1 \%\), respectively at 2-, 3-, and 5-\(\sigma \). We find good agreement between the percentages reported in Table 8 and these values, with minor deviations compatible with the statistical fluctuations of a random variable.

Table 8 Percentage of data points counted outside three confidence intervals for all possible combinations of retrievals and molecules

3.3.2 Retrieval \(\textrm{R}_1\)

Figure 9 shows the same analysis for the retrieval \(\textrm{R}_1\).

Fig. 9
figure 9

Same as Fig. 8 for the \(\textrm{R}_1\) retrievals, implementing a model that excludes NH\(_3\) from the fit-composition

The top row shows that, although there is still a correlation between the retrieved and input abundances, it is less significant than for \(\textrm{R}_0\). Furthermore, comparing the retrieved and input abundances yields different regimes for each molecule. However, the main difference from \(\textrm{R}_0\) is the significant number of data points at distances greater than \(2 \sigma \) (marked by black circles), corresponding to 2-\(\sigma \) outliers. In particular, for all molecules, most of these points are located to the right of the ideal trend, indicating the presence of an overestimation bias for the retrieved abundances. These data points are located in the region y \(\gtrsim 5\) and x > 0 in the plots in the middle row. Therefore, in addition to the overestimation bias for the abundances, their retrieved uncertainties are underestimated. Furthermore, the bottom-row diagrams show a larger number of outliers compared to the \(\textrm{R}_0\) case: too many for the posterior to be considered representative. This is a consequence of an atmospheric model which is not representative of the data, biasing the likelihood, the abundance posteriors, and the median estimator of the abundances.

3.3.3 Retrieval \(\textrm{R}_2\)

The results of the same analysis for the retrieval \(\textrm{R}_2\) are very similar to those of \(\textrm{R}_0\), including the number of outliers that are compatible with the expectations for a model that is representative of the data. Therefore, we refer the reader to Table 8, and to Fig. 15 in Section A of the Appendix. Here, we only stress that adding molecules to the fit-composition that are not present in the data set does not appear to significantly bias the abundance posteriors, compared to \(\textrm{R}_0\). This is further discussed in Section 4.2.

4 Discussion

In this section, we first discuss the similarities between the results from the retrievals \(\textrm{R}_0\) and \(\textrm{R}_2\), shown in Sections 3.1 and 3.2. Then we apply the ADI metric to compare all retrievals from the point of view of the Bayesian evidence (Section 4.1). Finally, we expand the discussion to the role of the priors in the retrieved abundance posteriors (Section 4.2).

The results of Sections 3.1 and 3.2 show that the predictions of the P-statistic for the retrievals \(\textrm{R}_0\) and \(\textrm{R}_2\) are comparable, despite the quite different fit-compositions, while the reliability of the P-statistic is lower in the \(\textrm{R}_1\) case. The \(\textrm{R}_0\) model and its parameters are identical to those used to generate the POP-Is population, and the \(\textrm{R}_2\) extends the parameter space with new molecules. In \(\textrm{R}_2\), the abundance posteriors for CH\(_4\), H\(_2\)O, and CO\(_2\) do not appear to be significantly affected by the addition of CO, HCN, and H\(_2\)S in \(\textrm{R}_2\), despite that the latter three spectral signatures partially overlap with those of CH\(_4\), H\(_2\)O, and CO\(_2\) [31]. It should be noted that the absence of the three molecules from the simulated atmospheres is correctly revealed in \(\textrm{R}_2\) by their low P-statistic, shown in Fig. 10, that take values smaller than \(40 \%\) for CO, HCN, and H\(_2\)S, respectively. The extension of the analysis to include the calibration and ROC curves to these molecules is left to future work.

The analysis, therefore, suggests that the P-statistic is robust (that means, provides reliable results) against retrieval models that are over-representative of the observed atmosphere. However, the P-statistic can no longer be considered robust when the retrieval models are under-representative of the observed atmosphere.

Fig. 10
figure 10

Histogram of the frequency of use of each possible P forecast for CO, HCN, and H\(_2\)S, using the abundance posteriors from the retrieval \(\textrm{R}_2\). The dotted vertical line marks the default binary classification threshold \(P = 0.5\) for reference

In the current study, the threshold abundance used to estimate the P-statistic remains constant for all molecules. While it is possible to optimize this threshold for individual molecules, we leave this aspect for future research as discussed in Section 2.4. Lowering the threshold reduces the information provided by the ROC curves. To achieve the optimal point of operation, one must balance the True and False Positive Rates, which is necessary to promote a Tier-1 target to higher Tiers. It is important to note that ROC curves calculated at different threshold levels provide a statistical estimation of the sample’s completeness, enabling the inference of population-wide properties such as the fraction of planets containing certain molecules. While this aspect requires further investigation in future research, it should be noted that the fraction of positive, \(\Sigma \) (planets with true abundance in excess of \(\mathbb {T}_{Ab}\)) is related to the fraction of Tier-1 targets, \(\tilde{\Sigma }\), selected with \(P(>\mathbb {T}_{Ab}) > \mathbb {P}\) by

$$\begin{aligned} \Sigma = \frac{\tilde{\Sigma } - FPR}{TPR - FPR} . \end{aligned}$$

The similarities between the \(\textrm{R}_0\) and \(\textrm{R}_2\) models are further discussed in the next section.

4.1 ADI comparison

The ADI metric, described in Section 2.3, is used to assess the statistical significance of a model atmosphere with respect to a featureless spectrum using the log-Bayesian factor. A large ADI suggests that a featureless spectrum is less favored by the data. From the ADI definition, the log-Bayesian factor of two competing models is the difference between their respective ADI.

Figure 11 shows the ADI differences between the \(\textrm{R}_0\) model and the two competing models, \(\textrm{R}_1\) and \(\textrm{R}_2\), plotted against NH\(_3\) abundances. A large, positive difference indicates that the competing models are less representative of the data compared to \(\textrm{R}_0\). The median ADI values for all retrievals are approximately 91, 86, and 92 for \(\textrm{R}_0\), \(\textrm{R}_1\), and \(\textrm{R}_2\), respectively, as shown in the text box within Fig. 11. This suggests that a featureless atmospheric model is not favored by the data, and \(\textrm{R}_1\) is the least representative, as expected. This is further supported by the fact that the ADI difference between \(\textrm{R}_0\) and \(\textrm{R}_1\) increases with increasing NH\(_3\) abundance, indicating that higher NH\(_3\) abundances make \(\textrm{R}_1\) less representative compared to \(\textrm{R}_0\), in agreement with the analysis of Section 3. In contrast, the ADI difference between \(\textrm{R}_0\) and \(\textrm{R}_2\) is close to zero, with a scatter described by a standard deviation of approximately 0.5, which is independent of NH\(_3\) abundance. This confirms that \(\textrm{R}_2\) is similarly representative of the data compared to \(\textrm{R}_0\), despite describing a wider parameter space.

Fig. 11
figure 11

Bayesian evidence comparison of the retrievals \(\textrm{R}_0\), \(\textrm{R}_1\), and \(\textrm{R}_2\), measured in ADI. The horizontal axis plots the input abundances of NH\(_3\); the vertical axis reports the ADI difference between \(\textrm{R}_0\) and the other two retrievals, \(\textrm{R}_1\) and \(\textrm{R}_2\). The y-axis uses a matplotlib “symlog” scale with the linear threshold set at 1 for better visualization. The text box on the bottom shows the median ADI reported by each retrieval

4.2 Priors

In this section, we discuss the impact of the log-uniform priors adopted in the analysis on the results presented. The consequence is a non-Gaussian posterior distribution, and the mean, mode, and median are not equivalent moments of the distribution. In particular, the median is not an unbiased estimator of the true abundance as shown in Fig. 8 for low log-abundance S/N (hereafter, “abundance S/N”). This can be explained in terms of the Bayesian formulation of the posterior, \(\mathcal {P}\), which is proportional to the product of the likelihood, \(\mathcal {L}\), and the prior, \(\Pi \).

$$\begin{aligned} \mathcal {P} \propto \mathcal {L} \times \Pi \end{aligned}$$
(4)

Because \(\Pi (\log x)\) is uniform, \(\Pi (x) \sim 1/x\), for large abundance S/N, the likelihood dominates, the posterior is Gaussian (because of the central limit theorem), and the median estimator is unbiased. For low abundances, the prior dominates, \(\mathcal {P}(x) \propto 1/x\), and the median is an estimator of the molecular abundance that is biased towards low abundances. This is shown in Fig. 12. Each panel shows the probability density function (PDF) of the likelihood, prior and posterior normalized to 1 at the peak, for three cases where the abundance S/N is 4.0, 5.5, and 7.0, respectively, from the top to the bottom panel, assuming an input abundance of \(10^{-5}\). The posterior is likelihood-dominated when the abundance S/N is 7 and is prior-dominated when the abundance S/N is 4.

Fig. 12
figure 12

The probability density functions (PDF) of the likelihood, prior and posterior are shown by the red, blue, and black lines, respectively. The PDFs are normalized to 1 at their peak. The assumed abundance S/N is 4.0, 5.5, and 7.0, respectively, from the top to the bottom panel. An input abundance of \(10^{-5}\) is assumed

Although logarithmic uniform priors are often assumed in spectral retrieval studies, they are certainly not “uninformative priors” [73, 74]. Clearly, using these priors biases the median estimator of the molecular abundance in the low S/N regime, explaining the trends seen in Fig. 8. As a side note, log-priors on molecular abundances could as well introduce biases on the derived elemental abundances, therefore the issue has to be investigated carefully in future studies.

The low abundance S/N targets are those that contribute to the leftmost peak in the bimodal distribution of the P-statistic (Fig. 4). Further investigation is however needed to fully understand the origin of the P-statistic bimodality and its under-forecasting properties.

5 Conclusion

The Ariel Tier 1 is a shallow reconnaissance survey of a large and diverse sample of approximately 1000 exoplanet atmospheres. It is designed to achieve a signal-to-noise ratio (S/N) greater than 7 when the target exoplanet atmospheric spectra are binned into 7 photometric bands. Tier 1 enables rapid and broad characterization of planets to prioritize re-observations in higher Tiers for detailed chemical and physical characterization. However, Tier 1 may not have sufficient S/N at the spectral resolution required for high-confidence abundance retrieval of chemical species. Nonetheless, it contains a wealth of spectral information that can be extracted to address questions requiring population studies.

In this study, we have introduced a P-statistic, which is a function of the data that is sensitive enough to reveal the presence of molecules from transit spectroscopy observations of exoplanet atmospheres and can be used as a binary classifier. The P-statistic is estimated from the marginalized retrieval posterior distribution and provides an estimate of the probability that a molecule is present with an abundance exceeding a threshold, fixed at \(\mathbb {T}_{Ab} \sim 10^{-5}\) in this study, but can be optimized in future analyses.

We have tested the performance of the P-statistic on a simulated population of gaseous exoplanets, POP-Is, with traces of H\(_2\)O, CH\(_4\), and CO\(_2\) of randomized abundances, in a H\(_2\)-He dominated atmosphere. NH\(_3\) is also included as a disturbance parameter to test the robustness of the P-statistic. For this, three models are used in the retrievals: R\(_0\), which is representative of the data; R\(_1\), which is under-representative as it excludes NH\(_3\); and R\(_2\), which is over-representative as it includes additional molecules not considered in the simulated POP-Is.

We find that the P-statistic estimated from R\(_0\) posteriors shows a clear, above-noise correlation with the input abundances, allowing us to infer the presence of molecules. The P-statistic appears to follow a bimodal distribution, where targets with low abundance S/N are likely contributors to the peak at low P values. This is supported by the distribution of the median of the abundance posterior, which is an unbiased estimator of the true value only when the abundance S/N is sufficiently large (typically above 5). The P-statistic is affected by an under-forecasting bias, but this is not expected to adversely affect the classification of the planets in the survey as it can be calibrated in principle. This is further evidenced by ROC curves with large AUC, indicating that the P-statistic can be used to implement a reliable classifier for the presence of molecules. However, further investigation is needed to fully understand the origin of the P-statistic bimodality and its under-forecasting properties.

The results obtained appear not to be affected by the increase in complexity of the assumed atmospheric model, implemented in this study with the R\(_2\) retrieval model, as indicated by similar calibration and ROC curves. We find that the predictive power of the P-statistic is adversely affected by an under-representative model, as implemented in the R\(_1\) retrieval model, which is evident from a weaker correlation between the P-statistic and the input abundances, and the median of the posterior abundance no longer being a reliable unbiased estimator of the true value, even in the high abundance S/N regime.

Based on our findings, we conclude that the P-statistic is a reliable predictor of the presence of molecules within the parameter space explored, as long as the retrieval model matches the complexity of the data. Models that are under-representative can result in poor predictive power, while the investigated over-representative model does not seem to adversely affect classification. Further investigations are needed to test the robustness of the P-statistic over a wider parameter space, particularly including a wider set of molecules in both the simulated population and retrievals.

Software

ArielRad [46], TauREx 3 [45], Alfnoor [42, 44], Astropy [75], h5py [76], Matplotlib [77], Numpy [78].