Data analysis Pipeline for EChO end-to-end simulations

Atmospheric spectroscopy of extrasolar planets is an intricate business. Atmospheric signatures typically require a photometric precision of $1 \times 10^{-4}$ in flux over several hours. Such precision demands high instrument stability as well as an understanding of stellar variability and an optimal data reduction and removal of systematic noise. In the context of the EChO mission concept, we here discuss the data reduction and analysis pipeline developed for the EChO end-to-end simulator EChOSim. We present and discuss the step by step procedures required in order to obtain the final exoplanetary spectrum from the EChOSim`raw data' using a simulated observation of the secondary eclipse of the hot-Neptune 55 Cnc e.


Introduction
The field of extrasolar planets is innovative as it is new. Recent successes in characterisation of extrasolar planets are also always tales of characterising the instrument response function to an unprecedented detail. Always being at the edge of technical feasibility means that instrument calibration, observing strategy as well as data analysis and modelling are interdependent. In the light of the EChO ESA-M3 mission concept [1], such interdependence becomes important in the study of engineering decisions and instrument trade-offs. In other words, one needs to simulate the full observational and data analysis chain in order to gauge the impact the instrument concept has on the achievable error bar of the detection. Such a feat requires an advanced mission end-to-end simulator as well as an advanced data analysis pipeline. In this paper, we discuss the data analysis pipeline which is used in conjunction to the mission simulator, EChOSim [2]. The EChOSim data pipeline (from here on EChOSim-DP) is a stand-alone software custom built for EChOSim but with easy adaptability to other instruments and data sets in mind.
The method by which the EChO mission will characterise the nature of extrasolar planets is by time resolved spectroscopy of their atmospheres, in particular of transiting extrasolar planets. Briefly, when an exoplanet transits in front of its host star (in our line of sight) we observe a diminishing of the stellar flux due to the obscuration of the planet. The depth of the resulting lightcurve allows us to estimate the planetary radius (given the stellar radius is known). This we refer to as transit (or primary eclipse) observation. Should the exoplanet feature an extended atmosphere, we expect some of the stellar light to filter through the terminator region of the planetary atmosphere. Here we are sensitive to molecules absorbing the stellar light at specific wavelengths. We hence perceive a variation of transit depths depending on the wavelength range observed. These variations constitute the signatures of an exoplanetary absorption spectrum. Similarly, we can observe the occultation (or secondary eclipse) where the thermal contribution of the exoplanet's day-side is lost to the observer as the planet passes behind its host star. The study of transmission and emission spectroscopy is now a well established field for both space and ground based observations of exoplanetary atmospheres (e.g. [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] also see [24] for a comprehensive review).

EChOSim
EChOSim is the EChO mission end-to-end simulator. EChOSim implements a detailed simulation of the major observational and instrumental effects, and associated systematics. It also allows the influence of individual instrumental and astrophysical parameters to be studied and thus represents a key tool in the optimisation of the instrument design. Observation and calibration strategies, data reduction pipelines and analysis tools can all be designed effectively using the realistic outputs produced by EChOSim [2,25]. The simulation output closely mimics standard STSci 1 FITS files, allowing for a high degree of compatibility with standard astronomical data reduction routines.

Examples
We illustrate individual steps in EChOSim-DP using diagrams. Unless specified otherwise, we follow a single data processing run of EChOSim simulated data of the hot-Neptune 55 Cnc e. EChOSim was run to simulate the Chemical Census mode of EChO, in which we co-add (in the case of 55 Cnc e) five eclipse observations to obtain a minimal signal-to-noise (S/N) of the final spectrum of S/N ∼ 5.
For this we assume spectra reconstructed with a resolving powers of 50,50,30,30,30 for the VNIR, SWIR, MWIR-1, MWIR-2, and LWIR channels. With each channel having a larger native resolving power, this allows us to increase the SNR of the detection for this particular observing mode. See [1] for a review of the proposed EChO observing modes.

Data Reduction
The EChOSim-DP is a stand-alone package delivered with the EChOSim code but can easily be adapted to observations produced by any spectrograph. It is written in fully object orientated Python allowing for a cross platform compatibility and an easy adaptability through its modular design. EChOSim-DP is subdivided into five main modules: 1) The data and parameter read-in and object initialisation, 2) data reduction, going from two dimensional focal plane illuminations to 1D time series data, 3) time series de-trending using non-parameteric de-trending algorithms, 4) lightcurve fitting using simplexdownhill minimisations as well as Markov Chain Monte Carlo (MCMC) techniques, 5) collection of results and computation of the final spectrum. We summarise this flow in figure 1.

Configuration and data formats
The output of EChOSim follows the standard FITS file conventions with the aim to make the raw data generated by EChOSim as universally readable as possible. The payload of EChO is subdivided into individual channels defined as: VNIR (0.4 -2.5µm), SWIR (2.5 -5.0µm), MWIR-1 (5.0 -8.5µm), MWIR-2 (8.5 -11.0µm) and LWIR (11.0 -16.0µm). For a detailed description of the individual channels we refer the reader to [1] and publications in this special issue. Due to varying detector array sizes, it is not possible to combine all focal plane read-outs (for an individual frame) in one conventional FITS data-cube. EChOSim hence utilises extensions to the Primary FITS Header Data Unit (PrimaryHDU). This allows the inclusion of meta data on each detector as well as additional auxiliary information carried in binary tables (BinaryHDUs). EChOSim produces one FITS file per integration interval resulting in 10s to 100s of files per simulated observation run. Whilst the high number of output files produced seems cumbersome, it reflects the data handling strategies of  current space and ground based instruments. EChOSim-DP is designed to be fully compatible to this customised FITS convention using a custom build read-in routine based on the PyFITS 2 package. EChOSim-DP can also natively read single HDU FITS files generated by other instruments. Auxiliary information contained in BinaryHDUs contains: the EChOSim generated stellar limb-darkening grid, EChOSim generated noiseless stellar, zodi and thermal fluxes from the instrument and its optical elements, EChOSim generated exoplanetary eclipse/transit depths, EChOSim generated Keplarian solutions. If specified by the user, EChOSim-DP can use these auxiliary data to calculate exact time series normalisation constants and eclipse/transit models to estimate best-case scenarios. EChOSim-DP specific parameters are specified in a separate ascii file and parsed using the python specific ConfigParser 3 .

Focal Plane Binning
Given current detector design specifications, the native spectral resolution (R = λ/∆λ) of EChO can exceed that required by the science case. EChOSim-DP provides two available spectral binning formats: 1) constant R, equation 1; 2) constant ∆λ, equation 2: where ∆x is the binning interval along the spectral axis in pixels, λ and λ mid the wavelength and central wavelength in µm and ∆ pix the pixel size in µm.
Note that EChO spectrometers sample each spectral resolving element with two detector pixels. Figure 3 shows ∆x for both binning methods as function of λ. Binning is performed directly on the focal plane before spectral extraction. This increases S/N and avoids potential biasing of the data.

Optimal extraction
After the data has been binned, we extract the raw spectrum along the spatial axis for each individual time stamp. At each integration time, the raw spectrum is extracted from the data by fitting a model of the PSF to the point-like dispersed signal of the star + planet flux. Two extraction options are available: 1) Unconstraint PSF, 2) EChOSim PSF with Fine Guidance Sensor (FGS) offset data.
Option 1: is the least constraint extraction. Depending on user input, EChOSim-DP fits a Gaussian or Generalised Gaussian Distribution (GGD) PSF along the spatial axis. The GGD is given by where µ y is the mean position of the spectrum along the spatial axis y for all frames, ∆y(t) is a time dependent offset from the mean, α is a scale parameter and in this case equivalent to α = 2σ y and σ y signifies the width of the PSF. The shape parameter β introduces a kurtosis argument in the Gaussian distribution. We retrieve the Normal PSF by setting β = 2 and obtain leptokurtic and platiokurtic distributions for β < 2 and β > 2 respectively. We do not assume skew of the PSF in the spatial direction. The PSF shape can either be left as free parameter (to be fitted from the data) or specified as user input. Equation 3 is convolved with the detector response function assumed by EChOSim to obtain the extraction profile.
where ⊗ is the convolution operator and the detector response [26] is given by where ∆ pix is the pixel size in µm and l y the diffusion length in µm. Option 2: Here we assume a Gaussian PSF (by setting β = 2) with a fixed width given by σ y = F #K y λ where F # is the effective focal length of the telescope in µm, K y is the PSF aberration parameter and λ the wavelength in µm. We hold µ y fixed at an EChOSim specified value and obtain the time dependent offset ∆y(t) from the EChOSim provided fine guidance sensor (FGS) centroiding.
The centroiding is provided as part of the auxiliary information BinaryH-DUs and consists of a time series of y-positional offsets sampled at 1Hz frequency. EChOSim-DP downsamples the positional offsets to the integration times specified in the FITS headers. The downsampling operation correctly reflects the error in the positional offset ∆y(t) and the associated flux error.
EChOSim-DP calculates the background by computing the median (or mean given user input) focal plane illumination 4σ y away from µ y . The background flux is integrated over the area (in pixels) of the extraction profile and subtracted form the extracted flux.

Normalisation
The final step is the normalisation of the data to the out of transit (OOT) baseline. Similarly to section 2.4 the normalisation can either be estimated from the data itself by calculating the OOT mean or normalised using noiseless stellar fluxes provided by EChOSim

Data de-trending
After the data as been reduced to 1D time series, EChOSim-DP can attempt a de-correlation of wavelength correlated non-Gaussian systematics. These systematics tend to be due to array wide fluctuations of quantum efficiencies, insufficient flat-fielding, slit-loss effects and pointing jitter. These complex non-Gaussian signals have shown to be important effects in real instruments [14,27,28,29]. EChOSimimplements inter and intra-pixel variations and non-Gaussian pointing jitter noise. Other non-linear noise sources such as correlated astrophysical noise (e.g. such as stellar pulsation, stellar spots and faculae noise) will be included in future releases of EChOSim.
Here we implement the ACICA de-trending algorithm [29]. Based on blinddeconvolution using Independent Component Analysis [27,30,31], we estimate the common non-Gaussian time and wavelength correlated signals and construct a systematic noise model which is then used to correct each individual time series. The advantage of these types of de-trending algorithms over others such as Gaussian Processes [32] are their non-parametric nature. This guarantees a high degree of objectivity in the de-trending as well as a simple implementation into existing code (due to the lack of parameterisation required).

Lightcurve modelling
Once the data is reduced and de-correlated, the pipeline provides several means of model fitting the resulting lightcurves. The modelling is divided into two main modes: Radiomentric and Dynamic. In the simplest model assumption, the radiometric case, we simply calculate the error bar from the out-of-transit (OOT) scatter of the time series and estimate the transit depth by taking the ratio of in-transit (IT) and OOT data. For the Dynamic case we use a full transiting planet model [33] and iteratively fit for the transit depth parameter using a simplex-downhill algorithm as well as a Markov Chain Monte Carlo (MCMC) routine.

Radiometric data analysis
For most cases, and for the sake of computational efficiency, the simplistic radiometric model results are desired for EChOSim observations. Let us assume a secondary eclipse measurement of an exoplanet. In the simplest radiometric case, we calculate the transit depth via the simple relation where δ is the transit depth, F out is the baseline flux (blue line in figure 5) and is defined as where t is the time index, N the number of observations in time, t 0−1 defines pre-ingress baseline time and t 4−5 post-egress timeline (see figure 5). Similarly we define the in-transit flux as where N is the number of observations and we assume that N out = N in = 2N as well as σ out = σ in .

Interpretation of radiometric model
The assumption σ out = σ in seems straight forward as one expects the photometric stability not to vary significantly between out-of-eclipse and in-eclipse times. The radiometric error as in equation 10 is the correct error treatment for the observation of a single lightcurve at a single wavelength with equal lengths of out-of-transit and in-transit measurements. It assumes that no additional knowledge of the baseline (out-of-transit) flux is available and describes the state of largest ignorance, i.e. σ → √ 2σ. Should additional knowledge of the baseline flux be available (via the calibration of the wavelength dependent stellar spectrum), we can reduce the normalisation error on the baseline. Hence for a perfect knowledge of the baseline flux level σ total → σ.

Dynamic data analysis
Going beyond the radiometric model assumptions, EChOSim-DP has two additional time-resolved lightcurve model modes: 1) Simplex and 2) MCMC.
In the simplex case, we fit an analytical lightcurve model [33] to each individual lightcurve in wavelength space, λ. It fully supports eccentric orbit calculations following [34] and allows all model parameters to be fitted. For lightcurves in wavelength ranges below 5µm we assume stellar limb-darkening for primary eclipses. Here we linearly interpolate the quadratic limb-darkening coefficients of [35] or read the limb-darkening coefficient grid provided by EChOSim to provide an exact match. For the model minimisation we use a simplex-downhill algorithm [36,37]. In this simple minimisation scheme, we obtain the error bar on the model fit using equation 10. Each modelling run creates a new model-fitting object in the data pipeline which allows multiple model runs (radiometric as well as dynamic) to be executed in the same instance of the EChOSim-DP.
We furthermore include a more computationally intensive Markov Chain Monte Carlo routine in EChOSim-DP. This routine allows us to investigate W a v e l e n g t h Flux T i m e where L(θ) is the model likelihood and π(θ) the prior distribution on the parameter θ. The likelihood is here assumed to be Gaussian and is given by where d is the data column vector, and d t and Φ t (θ) are the datum and lightcurve model at given timestamp t.
We use the PyMC 4 package implementing the adaptive Metropolis Hastings algorithm of [38]. The MCMC chains are typically run with 20,000 iterations taking the minimised result of the simplex-downhill algorithm as starting value to minimise burn-in time [39] which we restrict to 1000 iterations. We  [33] with the eclipse depth δ as only free parameter. Note the lack of stellar limb-darkening in secondary eclipses and hence a very discrete ingress and egress. here present the univariate version of the likelihood as in most cases all transit parameters but the depth, δ, are fixed. To minimise parameter covariances for multiple free parameters one can follow parameterisation by [40] or [41]. Using a bayesian approach, we can investigate more complex model solutions such as the impact of the stellar variability on the normalisation of individual lightcurves. Figure 6 illustrates a time series observation of a transiting exoplanet over a wide range of wavelengths. Here the blue curves represent the stellar spectrum, the black curves the time dependent flux variation due to the transiting extrasolar planet with the green line marking the minimum flux. As discussed in section 4.1.1, if all time series measurements are assumed to be independent of each other (i.e. not correlated in wavelength), we must assume an error of √ 2σ on the measurement, given the uncertainty of the OOT normalisation. However, it is clear from figure 6 that OOT flux of individual time series is correlated in λ through the stellar spectrum. For a perfect correlation (i.e. absolute knowledge on the correct normalisation of the individual time series) the measurement error hence reduces to σ. Hence the normalisation error, σ norm , is bound by 0 ≤ σ norm ≤ √ 2. We can now express the likelihood of our observation, L, as product of the likelihood of the lightcurve model, L(θ) and the stellar spectrum model L(ϕ). Note that by taking the product we implicitly assume statistical independence between lightcurve and stellar spectra models and below we explicitly assume a Gaussian noise model where χ 2 is the chi-squared distribution. We can now write the log-likelihood as follows where Φ(θ t ) is the lightcurve model for given time index t, Ψ (θ λ ) is the stellar model for given wavelength index λ, M is the number of resolution elements in the spectrum and σ t and σ λ are the flux uncertainties on the time series and the stellar spectrum respectively. Note that these error terms are not equivalent and also note thatF t=t2−3,λ is the averaged stellar spectrum from time interval t 2 − t 3 .

Outputs
Two types of outputs are provided: spectra in ascii format and python-pickel 5 objects. For each individual lightcurve fitting, EChOSim-DP provides an ascii file containing wavelength, measured flux and error. The pickle file contains all parameters, intermediate and final data products allowing for an exact reproducibly of results. Figure 9 shows the final spectrum for 55 Cnc e in the Chemical Census mode (blue error bars). Figures 10 and 11 show the same simulation for the Origins and Rosetta stone observing modes of EChO.  Fig. 9: Final spectrum generated from EChOSim-DP outputs for 55 Cnc e secondary eclipse run in Chemical census mode (i.e. 5 eclipses stacked, R = 50 for λ < 5µm and R = 30 for λ > 5µm). Blue: error bars derived from EChOSim-DP. Grey: planetary emission spectrum read into EChOSim. We marked prominent emission/absorption features.  Fig. 10: Final spectrum generated from EChOSim-DP outputs for 55 Cnc e secondary eclipse run in Origin mode (i.e. 17 eclipses stacked, R = 100 for λ < 5µm and R = 30 for λ > 5µm). Blue: error bars derived from EChOSim-DP. Grey: planetary emission spectrum read into EChOSim. We marked prominent emission/absorption features.

Discussion & Conclusion
EChOSim-DP is a custom built data reduction and analysis pipeline for the EChOSim end-to-end mission simulator of the EChO mission concept.
Despite its customised nature, we have developed the pipeline with easy adaptability (through its fully object-orientated programming ) to other in-  Fig. 11: Final spectrum generated from EChOSim-DP outputs for 55 Cnc e secondary eclipse run in Rossetta mode (i.e. 65 eclipses stacked, R = 300 for λ < 5µm and R = 30 for λ > 5µm). Blue: error bars derived from EChOSim-DP. Grey: planetary emission spectrum read into EChOSim. Inset is a zoom into the 2.2 -2.5 µm wavelength region. We marked prominent emission/absorption features.
struments and data-sets in mind. The pipeline features state of the art data de-correlation algorithms as well as a full Bayesian analysis implementation via adaptive MCMC. Both these aspects, the de-trending as well as the exploration of stellar variability are not required for the current version of EChOSim (version 3.x) but included with future releases. These releases will have special emphasis on realistic stellar noise simulations [42] as well as more advanced non-Gaussian instrument systematics.