1 Introduction

The standard model of Cosmology consists of a flat, homogeneous and isotropic universe whose energy content is dominated by a cosmological constant (\(\Lambda \)) and cold dark matter (\(\Lambda \)CDM) [1, 2]. Such a model provides the best description of cosmological observations such as temperature fluctuations of the Cosmic Microwave Background (CMB) [3], luminosity distances to Type Ia Supernovae (SNe) [4], large-scale clustering of galaxies (LSS), and weak gravitational lensing (WL) [5,6,7,8]. Despite its tremendous success, this model presents theoretical caveats, such as the value of the vacuum energy density [9, 10], in addition to observational challenges e.g. the \(\simeq 5\sigma \) Hubble constant tension between CMB and SNe observations [11,12,13], as well as milder tensions between matter density perturbation estimates from CMB and LSS, and slightly enhanced CMB lensing amplitude than predicted by the \(\Lambda \)CDM model. These conflicting measurements may hint at physics beyond the standard cosmology.

Given the necessity to probe the Universe at larger and deeper scales, cosmological surveys like Javalambre-Physics of the Accelerated Universe Astrophysical Survey (J-PAS) [14, 15], Dark Energy Spectroscopic Instrument (DESI) [16], Euclid [17], Square Kilometer Array (SKA) [18] and the Large Synoptic Survey Telescope (LSST) [19] were proposed and developed. They will improve the constraints we currently have on the parameters of the \(\Lambda \)CDM model, and probe departures from it with unprecedented sensitivity. In order to extract the most of cosmological information from the tantalising amount of data to come, the deployment of machine learning (ML) algorithms on Physics and Astronomy [20, 21] is becoming crucial to accelerate data processing and improve statistical inference. Some recent applications of ML on Cosmology focuses on reconstructing the late-time cosmic expansion history to test fundamental hypothesis of the standard model and constrain its parameters [22,23,24,25,26,27,28,29,30,31,32,33,34,35], cosmological model discrimination with LSS and WL [36,37,38,39,40,41,42,43,44,45,46], predicting structure formation [47,48,49,50,51,52,53,54,55,56,57], probing the era of reionisation [58,59,60,61,62,63,64], photometric redshift estimation [65,66,67,68,69,70,71,72], besides the classification of astrophysical sources [73,74,75,76] and transient objects [77,78,79,80]. These analyses reveal that ML algorithms are able to recover the underlying cosmology from data and simulations with greater precision than traditionally used techniques e.g. 2-point correlation function and power spectrum, in addition to Markov Chain Monte Carlo (MCMC) methods.

In this paper we discuss the ability to measure the Hubble Constant \(H_0\) from cosmic chronometers measurements (H(z)) using different ML algorithms. We first produce H(z) synthetic data-sets with different number of data points and measurement uncertainties, in order to perform a benchmark test of the \(H_0\) constraints for each algorithms given the quality of the input data. Rather than performing a numerical reconstruction across the redshift range probed by the data, and then fitting \(H_0\), we carry out an extrapolation of the reconstructed H(z) values down to \(z=0\). We also compare their performance with other non-parametric reconstruction methods, such as the popularly adopted Gaussian Processes (GAP) [81]. Our goal is to verify whether they can provide a competitive cross-check with the GAP.

The paper is structured as follows: Sect. 2 is dedicated to the cosmological framework and the simulations produced for our analysis. Section 3 explains how this analysis is performed, along with the metrics adopted for algorithm performance evaluation. Section 4 presents our results; finally our main conclusions and final remarks are presented in Sect. 5.

2 Simulations

2.1 Prescription

In order to compare how different predicting algorithm perform with different quality of data, we produce sets of H(z) simulated data sets and adopt the following prescription:

  1. (i)

    We assume as fiducial cosmology the flat \(\Lambda \)CDM model given by Planck 2018 (TT, TE, EE+lowE+lensing; hereafter P18) [3]:

    $$\begin{aligned} H^{\textrm{fid}}_0= & {} 67.36 \pm 0.54 \, \mathrm {km \, s}^{-1} \, \textrm{Mpc}^{-1} \, \end{aligned}$$
    (1)
    $$\begin{aligned} \Omega ^{\textrm{fid}}_{\textrm{m}}= & {} 0.3166 \pm 0.0084 \,\end{aligned}$$
    (2)
    $$\begin{aligned} \Omega ^{\textrm{fid}}_{\Lambda }= & {} 1-\Omega _{\textrm{m}} , \end{aligned}$$
    (3)

    so that the Hubble parameter follows the Friedmann equation for the fiducial \(\Lambda \)CDM model

    $$\begin{aligned} \left[ \frac{H^{\textrm{fid}}(z)}{H^{\textrm{fid}}_0}\right] ^2 = \Omega ^{\textrm{fid}}_{\textrm{m}}(1+z)^3 + \Omega ^{\textrm{fid}}_{\Lambda } . \end{aligned}$$
    (4)
  2. (ii)

    We compute the values of H(z) considering the \(N_z\) data points following a redshift distribution p(z) such as [27]

    $$\begin{aligned} p(z; \; k,\theta ) = z^{k-1}\frac{e^{-z/\theta }}{\theta ^{k}\Gamma (k)} \;, \end{aligned}$$
    (5)

    where we fix \(\theta \) and k to their respective best fits to the real cosmic chronometers data, as in [27], i.e., \(\theta _{\textrm{bf}}=0.647\) and \(k=1.048\).

  3. (iii)

    In order to understand how our knowledge of H(z) along the redshift space affects the performance of the statistical learning, we provide different sets of H(z) assuming different numbers of points \(N_z\) = 20, 30, 50 and 80, and assuming different relative uncertainties values, i.e., \(\sigma _H/H=0.008, 0.01, 0.03, 0.05, 0.08\). This variation of \(N_z\) and \(\sigma _H(z)\) allows to evaluate what level of accuracy of measurements of H(z) is necessary in order to obtain a specific precision on the prediction of \(H_0\).

  4. (iv)

    We also produce H(z) simulations based on the current cosmic chronometer data, which consists of \(N_z=31\) measurements presented in 1 – see also Table I in [27].

Such a prescription provides a benchmark to test the performance of the ML algorithms deployed.

Table 1 31 CC H(z) measurements obtained from the differential age method used in our analysis

2.2 Uncertainty estimation

Although these algorithms are able to provide measurements of \(H_0\) at a given redshift, they do not provide their uncertainties. We develop a Monte Carlo-bootstrap (MC-bootstrap) method for this purpose, described as follows

  • Rather than creating a single simulation centered on the fiducial model for each data-set (item (i) of Sect. 2), we produce H(z) measurements at a given redshift following a normal distribution centred around its fiducial value according to \({\mathcal {N}}(H^{\textrm{fid}}(z),\sigma _{H}/H)\). \(H^{\textrm{fid}}(z)\) represents the H(z) value given by the fiducial Cosmology, whereas \(\sigma _{H}\) consists on its uncertainty as described in the item (iii) of Sect. 2.

  • As for the “real data” simulations, described in item (iv) of Sect. 2, we replace the ith H(z) measurement presented in the second column of Table 1 by a value drawn from a normal distribution centered on the fiducial model, i.e, \({\mathcal {N}}(H^{\textrm{fid}}(z_i),\sigma _{H;i})\), where \(z_i\) represents the redshift of each data point and \(\sigma _{H;i}\) its corresponding uncertainty – first and third columns in Table 1, respectively.

  • We repeat this procedure 100 times for each data-set of \(N_z\) data-points with \(\sigma _H/H\) uncertainties as described in the item (iii) and (iv) of Sect. 2.

  • The 100 MC realisations produced for each case are provided as inputs for each ML algorithms described in Sect. 3.1.

  • We report the average and standard deviation of these 100 values as the \(H_0\) measurement and uncertainty, respectively, for each \(N_z\) and \(\sigma _H/H\) case. Same applies for the “real data” simulations.

3 Analysis

3.1 Methods

Our regression analysis are carried out on all simulated and “real” data-sets with several ML algorithms available in the scikit-learn packageFootnote 1 [90]. Firstly we divide our input sample into training and testing data-sets as

figure a

so that our testing sub-set contains 25% of the original sample size. Then we deploy different ML algorithms on the training test, looking for the “best combination” of hyperparameters with the help of GridSearchCV.Footnote 2 This function of scikit-learn, given a ML method, performs the learning with all the combination of hyperparameters and shows the performance of every combination – or each one of them – during the cross-validation (CV) procedure .Footnote 3 Such a procedure is performed for the sake of avoiding overfitting on the test set. We chose CV \(=3\) in our analysis, given the limited number of H(z) data-points.

The ML methods deployed in our analysis are given as follows:

  • Extra-Trees (EXT): An ensemble of randomised decision trees (extra-trees). The goal of the algorithm is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximationFootnote 4 [91]. We evaluate the algorithm hyperpameter values that best fit the input simulations through a grid search. Hence, our grid search over the EXT hyperpameters are given by:

    figure b
  • Artificial Neural Network (ANN): A Multi-layered Perceptron algorithm that trains using backpropagation with no activation function in the output layerFootnote 5 [92]. The ANN hyperparameter grid search consists of:

    figure c
  • Gradient Boosting Regression (GBR): This estimator builds an additive model in a forward stage-wise fashion; it allows for the optimisation of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss functionFootnote 6 [93]. The grid search over the GBR hyperparameters corresponds to:

    figure d
  • Support Vector Machines (SVM): A linear model that creates a line or hyperplane to separate data into different classes [94, 95]. Originally developed for classification problems, it was also extended for regression, as in the goal of this work.Footnote 7 The hyperparameter grid search of the SVM method reads

    figure e

Note that we adopted the default evaluation metric for each ML algorithm as defined by the scikit-learn package. So the EXT method uses the squared error metric to define the quality of the tree split – likewise for the GBR and ANN loss functions - whereas the SVM method assumes \(\epsilon =0.1\), so that samples whose prediction is at least \(\epsilon \) away from their true target are penalised.Footnote 8

In order to evaluate the performance of these methods, we report the results of the training and test score as

figure f

Moreover, we deploy the well-known Gaussian Processes regression (GAP) algorithm on the same simulated data-sets using the GaPP package [81]. We compare the results obtained with the ML algorithms just described with GAP since the latter has been widely used in the literature for similar purposes for about a decade. Two GAP kernels are assumed in our analysis, namely the Squared Exponential (SqExp) and Matérn(5/2) (Mat52). We justify these choices on the basis that the SqExp kernel exhibits greater differentiability than the Mat52, which may result in a larger degree of smoothing on the data reconstruction – hence, smaller reconstruction uncertainties – that may or may not fully represent the underlying data.

3.2 Robustness of results

We define the bias (b) as the average displacement of the predicted Hubble Constant (\(H^{\textrm{pred}}_0\)), obtained from the MC-bootstrap method, from the fiducial value, i.e., \(\Delta H_0=H^{\textrm{pred}}_0 - H^{\textrm{fid}}_0\), and the Mean Squared Error (MSE) as the average squared displacement:

$$\begin{aligned} \textrm{b} = \langle \Delta H_0 \rangle , ~~~ \textrm{MSE} = \langle \Delta H_0^2 \rangle . \end{aligned}$$
(6)

Using the definition of variance, we estimate the bias-variance tradeoff of our analysis

$$\begin{aligned} \textrm{BVT} = \langle \Delta H_0^2 \rangle - \langle \Delta H_0 \rangle ^2 = \textrm{MSE} - \textrm{b}^2 \;, \end{aligned}$$
(7)

therefore we can evaluate the performance of these algorithms for each simulated data-set specification.

4 Results

Fig. 1
figure 1

\(H_0\) measurements from the algorithm EXT (top left), ANN (top right), GBR (center left), SVM (center right), SqExp (lower left) and Mat52 (lower right), plotted against the number of simulated H(z) measurements. Each data point represents different \(\sigma _H/H\) values, whereas the light blue horizontal lines denote the fiducial \(H_0\) value

Fig. 2
figure 2

Same as Fig. 1, but for the BVT. Each data point corresponds to different \(\sigma _H/H\) values

Fig. 3
figure 3

The reconstructed H(z) values obtained from EXT (top left), ANN (top right), GBR (center left), SVM (center right), SqExp (lower left) and Mat52 (lower right). Different shades of magenta represent different confidence level for the reconstructions, ranging from 1 (darker shade) to \(3\sigma \) (lighter shade)

We show our \(H_0\) measurements for each algorithm in Fig. 1. The top panels present the results obtained from the EXT (left) and ANN (right) algorithms, whereas the middle panels display the GBR (left) and SVM (right) results, and the bottom ones show the GAP predictions for the Mat52 and SqExp kernels in the left and right plots, respectively. Each data point at these plots represents the predicted \(H_0\) values (\(H^{\textrm{pred}}_0\)) according to the prescription described in Sect. 2.1 for each simulated data-set specifications, i.e., different \(\sigma _{H}/H\) results against the \(N_z\) values. The light blue horizontal line corresponds to the fiducial \(H_0\). We can see that GBR and EXT are able to correctly predict the fiducial \(H_0\) for higher \(N_z\) and lower \(\sigma _{H}/H\) values, but not otherwise – especially for low values of \(N_z\), where these algorithms predict a larger \(H_0\) value. This indicates a bias in the results, despite the cross-validation procedure adopted. We also find that ANN and SVM are able to recover the fiducial \(H_0\), even for lower quality sets of simulations. However, its predictions present larger variances as \(\sigma _{H}/H\) increases.

These results show that GBR and EXT are more sensitive to \(N_z\) concerning bias, whereas ANN is more sensitive to \(\sigma _{H}/H\) with respect to its variance. On the other hand, SVM exhibits the best bias-variance tradeoff among all algorithms, as shown in Fig. 2, along with Tables 2, 3, 4, 5 and 6, presented in Appendix B. We obtain that SVM is able to recover the fiducial \(H_0\) without significant losses in bias and variance as the simulation quality decreases. Such a result may happen due to a few reasons: For example, due to the non-guaranteed convergence of neural networks. By an appropriate choice of hyperparameters, ANN can approach a target function until a satisfactory result is reached; however, SVMs are theoretically grounded in their capacity to converge to the solution for a problem. Note that we adopted a polynomial kernel for SVM, and a nonlinear activaction function for ANN, namely “relu”, in order to make a fair comparison between them. As for the EXT case, such an algorithm is known to be prone to overfitting and to outliers, which can explain the larger bias with lower variance. A similar problem applies for the GBR case as well. Not to mention that both demand a longer training time than ANN and SVM, which translates into a longer computational time to obtain the \(H_0\) measurements and uncertainties in our case.

Regarding the comparison with the results obtained with GAP using the Squared Exponential (SqExp) and Matérn(5/2) (Mat52) kernels, we find good agreement between the SVM and GAP measurement of \(H_0\), in spite of a slightly larger BVT for the former. But note that GAP can also be prone to overfitting, as the data points with smaller uncertainties have a greater impact to determine the function that best represents the distance between data points to perform its numerical reconstruction – especially when assuming the SqExp kernel, which presents greater differentiability than the Mat52 case. Therefore, we show that SVM can be used as a cross-check method for GAP regression, which has been widely used in the literature..

Moreover, we show the H(z) reconstructions obtained from the simulations mimicking the real data configuration in Fig. 3, at a 1, 2 and \(3\sigma \) confidence level, alongside the actual H(z) measurements. We can clearly see a “step-wise” behaviour on the EXT and GBR reconstructions, contrarily to other algorithms. This illustrates the bias problems they face, as commented before. Once again, the GAP results exhibit the smallest uncertainties among all, but this may also happen due to possible overfitting, as commented before. This is exemplified by the dip on the high-z end of the reconstruction. Nevertheless, the Hubble Constant measured by all algorithms are in agreement with each other, as depicted in Table 7, where EXT and GBR again exhibit a tendency towards larger \(H_0\) values – and hence larger BVT – while ANN and SVM present less biased results, as in the previous cases. Interestingly, we find that ANN performed slightly better than SVM this time around, yielding a slightly lower BVT. Note also that our results are in good agreement with the predicted \(H_0\) in [27], who used ANN as well in their analysis,Footnote 9 but we could obtain a slightly lower uncertainty in our predictions – roughly 17% uncertainty versus 23% in their case.

We also checked whether the test sample size affects the \(H_0\) predictions and their bias-variance tradeoff. We find that its default choice, i.e., a split between 75 and 25% between training and test sample, respectively, provides the best results for all algorithms compared to a 90–10% and 60–40% split, for instance. The EXT and GBR algorithms perform similarly for 10% and 25%, but their predictions become significantly worse for the 40% case. This is an expected result, since both algorithms require large training sets to carry out such predictions. On the other hand, ANN and SVM perform significantly worse for split choices other than the default one. Finally, we verified the results for different cross-validations values, such as CV \(=2,4,8\), finding consistent values with those obtained with the standard choice CV \(=3\).

5 Conclusions

Machine learning has been gaining formidable importance in present times. Given the state-of-art of modern computation and processing facilities, the application of machine learning algorithms in physical sciences not only became popular, but essential in the process of handling huge data-sets and performing model prediction - specially in light of forthcoming redshift surveys.

Our work focused on a comparison of different machine learning algorithms for the sake of measuring the Hubble Constant from cosmic chronometers measurements. We used four different algorithms in our analysis, which are based on decision trees, artificial neural networks, support vector machine and gradient boosting, as available in the scikit-learn python package. We applied them on simulated H(z) data-sets assuming different specifications, and assuming a flat \(\Lambda \)CDM model consistent with Planck 2018 best fit, in order to measure \(H_0\) through an extrapolation procedure.

Our uncertainties are estimated using a Monte Carlo-bootstrap method on the simulations, after properly splitting them into training and test sets, and performed a grid search over their hyperparameter space during the cross-validation procedure. In addition, we created a performance ranking between these methods via the bias-variance tradeoff, and compared them with other established methods in the literature, e.g. Gaussian Processes as in the GaPP code.

We obtained that the algorithms based on decision trees and gradient boosting present the lowest performance among all, as they provide low variance with a large bias in the reconstructed \(H_0\). Instead, the artificial neural networks and support vector machine are able to correctly recover the fiducial \(H_0\) value, where the latter method exhibits the lowest variance among them. We also found that the support vector machine algorithm presents compatible benchmark metrics with the Gaussian Processes one. This result shows that such method can be successfully used as a cross-check method between different non-parametric reconstruction techniques, which will be of great importance in the advent of next-generation cosmological surveys [14,15,16,17,18,19], as they are expected to provide H(z) measurements with a few percent precision.