1 Introduction

The (flat) \(\Lambda \)CDM cosmological model is an extremely successful minimal model that returns seemingly consistent cosmological parameters across Type Ia supernovae (SNe) [1, 2], Cosmic Microwave Background [3] and baryon acoustic oscillations [4]. Despite this success, comparison of early and late Universe cosmological parameters has revealed discrepancies [5,6,7,8,9,10,11]. The origin [12, 13] and resolution [14, 15] of these anomalies is a topic of debate. We observe that the \(\Lambda \)CDM model describes approximately 13 billion years of evolution of the Hubble parameter H(z) in the late Universe (conservatively redshifts \(z \lesssim 30\)) with a single fitting parameter, matter density today \(\Omega _m\).Footnote 1 Objectively, given the prevailing belief that \(\Omega _m \sim 0.3\), this marks billions of years of evolution with effectively no free parameters.

As originally pointed out [16] (see also [17]), within the FLRW framework, any mismatch between H(z), an unknown function inferred from Nature, and a theoretical assumption on the effective EoS \(w_{\text {eff}}(z)\), e. g. the \(\Lambda \)CDM model, must mathematically lead to a Hubble constant \(H_0\) that evolves with effective redshift. Simply put, a redshift-dependent \(H_0\) is indicative of a bad model [16]. This prediction can be tested in the late Universe, where the \(\Lambda \)CDM model reduces to two fitting parameters:

$$\begin{aligned} H(z) = H_0 \sqrt{1-\Omega _m + \Omega _m (1+z)^3}. \end{aligned}$$
(1)

To date, independent studies have documented decreasing \(H_0\) trends within model (1) across strong lensing time delay (SLTD) [18, 19], Type Ia supernovae (SNe) [20,21,22,23,24,25] and combinations of cosmological data sets [26,27,28]. Moreover, quasar (QSO) Hubble diagrams [29,30,31] show a preference for larger than expected \(\Omega _m\) values, \(\Omega _m \gtrsim 1\) [32,33,34,35]. It was subsequently noted that \(\Omega _m\) increases with effective redshift in SNe and QSO samples [24, 36, 37]. Although the trend in any given observable is not overly significant, e. g. \(\lesssim 2 \sigma \) for SLTD [18, 19], combining probabilities from independent observables using Fisher’s method, the significance increases quickly [25].

Simple binned mock \(\Lambda \)CDM data analysis [25, 38] suggests that evolution of \((H_0, \Omega _m)\) best fit parameters must be expected in any data set that only provides either observational Hubble H(z) or angular diameter \( D_{A}(z)\) or luminosity distance \(D_{L}(z)\) constraints.Footnote 2 If true, one can expect to separate any given sample into low and high redshift subsamples and see discrepancies in the \((H_0, \Omega _m)\)-plane. Here, we highlight the feature in the latest Pantheon+ SNe sample [39, 40].Footnote 3 The main message of this letter is that evolution of \((H_0, \Omega _m)\) with effective redshift persists in Pantheon+ SNe. Furthermore, an increasing \(\Omega _m\) trend, evident at higher redshifts, continues beyond \(\Omega _m=1\) giving rise to negative DE densities at \(z \gtrsim 1\). In light of concerns highlighted in [45,46,47], the Pantheon+ sample improves on redshift corrections [48]. Thus, errors in the handling of redshifts can be precluded as the origin of the trend. It is worth stressing again that [25, 38] provide a mathematical proof that redshift evolution of best fit \(\Lambda \)CDM parameters cannot be ruled out in mock Planck-\(\Lambda \)CDM data. There are then two relevant questions. Is redshift evolution of best fit \(\Lambda \)CDM parameters evident in observed data? If so, what is its statistical significance?

In cosmology the default is to assess statistical significance with Markov Chain Monte Carlo (MCMC). The increasing \(\Omega _m\) trend is evident in MCMC posteriors, but as we demonstrate, the \(H_0\) posterior is subject to projection effects due to a degeneracy (banana-shaped contour) in the 2D \((H_0, \Omega _m)\) posterior. In the literature, this is interpreted as the data failing to constrain the model, but as we will show in Sect. 5, this is a misconception because it is not supported by the \(\chi ^2\) (see also [49]). We overcome the MCMC degeneracy in three complementary ways. First, we provide a Bayesian comparison between the \(\Lambda \)CDM model and the \(\Lambda \)CDM model with a split at redshift \(z_{\text {split}}\), where \((H_0, \Omega _m)\) are allowed to adopt different values at low and high redshift. Secondly, we employ a frequentist comparison between best fits of the observed data and mock data that focuses on different criteria quantifying evolution in the sample. Finally, we analyse the \(\chi ^2\) through profile distributions. For the Pantheon+ sample split at \(z_{\text {split}}=1\) we find a shift in the cosmological parameters that exceeds 95% confidence level. Note, Pantheon+ is presented as a sample in the redshift range \(0 < z \le 2.26\), but redshift evolution of cosmological parameters is evident in the \(\Lambda \)CDM model from \(z = 0.7\) onwards.

Hints of negative DE densities, especially at higher redshifts, are widespread in the literature, so our observations in Pantheon+ may be unsurprising. Indeed, while \(\Lambda \)CDM mock analysis in [25, 38] confirms that \(\Omega _m > 1\) best fits are precluded with low redshift data, this is no longer true at higher redshifts. We stress again that this is a purely mathematical feature of the \(\Lambda \)CDM model. Starting with studies incorporating Lyman-\(\alpha \) BAO [50], one of the first observables discrepant with Planck-\(\Lambda \)CDM [51,52,53], claims of negative DE densities at higher redshifts, including anti-de Sitter (AdS) vacua at high redshift [54, 55] (however see [56])Footnote 4 and features in data reconstructions [62,63,64,65,66], have been noticeable.Footnote 5 This has led to extensive attempts to model negative DE densities [71,72,73,74,75,76,77,78,79,80,81], most simply as sign switching \(\Lambda \) models [82,83,84,85,86]. Given the sparseness of SNe data beyond \(z=1\), claims of negative DE densities are usually attributed to Lyman-\(\alpha \) BAO,Footnote 6 but here we see the same feature in state of the art Pantheon+ SNe. It is plausible that selection effects are at play (see discussion in [20]), but if the arguments in [25, 38] hold up, then \(\Omega _m > 1\) \(\Lambda \)CDM best fits to data in high redshift bins cannot be precluded. On the contrary, they can be expected.

2 Preliminaries

Our analysis starts by following and recovering results in [88] (see also [39]). We set the stage with a preliminary consistency check. In short, we extremise the likelihood,

$$\begin{aligned} \chi ^2 = {\vec {Q}}^{T} \cdot (C_{\text {stat+sys}})^{-1} \cdot {\vec {Q}}, \end{aligned}$$
(2)

where \({\vec {Q}}\) is a 1701-dimensional vector and \(C_{\text {stat+sys}}\) is the covariance matrix of the Pantheon+ sample [39]. The Pantheon+ sample has 1701 SN light curves, 77 of which correspond to galaxies hosting Cepheids in the low redshift range \(0.00122 \le z \le 0.01682\). In order to break the degeneracy between \(H_0\) and the absolute magnitude M of Type Ia, we define the vector

$$\begin{aligned} Q_i = {\left\{ \begin{array}{ll} m_i -M - \mu _i, \quad i \in \text {Cepheid hosts} \\ m_i - M - \mu _{\text {model}}(z_i), \quad \text {otherwise} \end{array}\right. } \end{aligned}$$
(3)

where \(m_i\) and \(\mu _i \equiv m_i - M\) denote the apparent magnitude and distance modulus of the \(i^{\text {th}}\) SN, respectively. The cosmological model, which we assume to be the \(\Lambda \)CDM model (1), enters through the following relations:

$$\begin{aligned} \mu _{\text {model}}(z)= & {} 5 \log \frac{d_{L}(z)}{\text {Mpc}} + 25, \nonumber \\ d_{L}(z)= & {} c (1+z) \int _{0}^{z} \frac{\text {d} z^{\prime }}{H(z^{\prime })}. \end{aligned}$$
(4)

Extremising the likelihood, one arrives at the best fit values,

$$\begin{aligned} H_0 = 73.42 \text { km/s/Mpc}, \quad \Omega _m = 0.333, \quad M = -19.248,\nonumber \\ \end{aligned}$$
(5)

which are in perfect agreement with [88]. We estimate the \(1 \sigma \) confidence intervals through an MCMC exploration of the likelihood with emcee [89], finding excellent agreement with Fisher matrix analysis [88],

$$\begin{aligned} \begin{aligned} H_0&= 73.41^{+0.97}_{-1.00} \text { km/s/Mpc}, \\ \Omega _m&= 0.333^{+0.018}_{-0.017}, \quad M = - 19.248^{+0.028}_{-0.030}. \end{aligned} \end{aligned}$$
(6)

It is interesting to compare Pantheon+ constraints on \(\Omega _m h^2\) \((h:= H_0/100)\) directly with Planck. In Fig. 1 we highlight a \(3.7 \sigma \) tension,Footnote 7 which importantly impacts the high redshift behaviour of \(H(z) \sim H_0 \sqrt{\Omega _m} (1+z)^{\frac{3}{2}}\) in the late Universe. This is interesting, as we start to see evolution in best fit \(\Lambda \)CDM parameters at higher redshifts. Given the tension in the Hubble constant [5,6,7,8,9], our focus here is on \(H_0\) and by extension \(\Omega _m\), since both parameters are correlated when one fits data. Of course, if the fitting parameters \(H_0\) and \(\Omega _m\) change with effective redshift, there is no guarantee that \(\Omega _m h^2\) is a constant. If the constancy of \(\Omega _m h^2\) can be tested, this allows one to study the assumption that matter is pressureless. Such studies will require data exclusively in the matter dominated regime where DE and radiation sectors are irrelevant. Given the sparcity of high redshift \(z > 1\) data, competitive studies are still a few years off.

Fig. 1
figure 1

\(3.7 \sigma \) tension between Planck and Pantheon+ for the parameter combination that dictates the high redshift behaviour of the Hubble parameter H(z) in the late Universe. We made use of GetDist [90]

3 Splitting Pantheon+

Having confirmed the results quoted in [88], we depart from earlier analysis and crop the Pantheon+ covariance matrix in order to isolate the \(77 \times 77\)-dimensional covariance matrix \(C_{\text {Cepheid}}\) corresponding to SNe in Cepheid host galaxies and define a new likelihood that is only sensitive to the absolute magnitude M,

$$\begin{aligned} \chi _{\text {Cepheid}}^2= & {} ({\vec {Q}}^{\prime })^{T} \cdot (C_{\text {Cepheid}})^{-1} \cdot {\vec {Q}}^{\prime }, \nonumber \\ Q_i^{\prime }= & {} m_i - M - \mu _i, \quad i \in \text {Cepheid hosts}. \end{aligned}$$
(7)

We can now split the remaining 1624 SNe into low and high redshift samples, which we demarcate through a redshift \(z_{\text {split}}\). One can crop the original covariance matrix accordingly to get \(C_{\text {SN}}\) for either the low or high redshift sample, but we will primarily focus on the high redshift subsample with \(z > z_{\text {split}}\). The reason being that SNe samples have a low effective redshift, \(z_{\text {eff}} \sim 0.3\), and it is well documented that Planck values \(\Omega _m \sim 0.3\) are preferred. The hypothesis we explore is that such results overlook evolution at higher redshifts, so this explains the focus on high redshift subsamples. In summary, we study the new likelihood,

$$\begin{aligned} \chi ^2 = \chi ^2_{\text {Cepheid}} + \chi _{\text {SN}}^2, \end{aligned}$$
(8)

where we have defined,

$$\begin{aligned} \begin{aligned} \chi _{\text {SN}}^2&= ({\tilde{Q}})^{T} \cdot (C_{\text {SN}})^{-1} \cdot {\tilde{Q}}, \\ \tilde{Q}_i&= m_i - M - \mu _{\text {model}}(z_i). \end{aligned}\end{aligned}$$
(9)

The redshift range of the Pantheon+ sample [39] is \(0.00122 \le z \le 2.26137\), so we take \(z_{\text {split}}\) in this range. In the next section we begin the tomographic analysis of splitting the Pantheon+ sample into a low and high redshift subsample. We remark that the likelihoods presented in Eqs. (2) and (8) omit a constant normalisation. Being a constant, it plays no role when one fits data, and is thus routinely omitted in the literature [39]. However, this term is relevant when one performs Bayesian model comparison. We will reinstate the normalisation later.

4 Analysis

It is widely recognised that confronting exclusively high redshift SNe data to the \(\Lambda \)CDM model, MCMC inferences are typically impacted by degeneracies, i. e. banana-shaped posteriors, in the \((H_0, \Omega _m)\)-plane. Later we confirm the impact of projection effects on MCMC posteriors as priors are relaxed in the presence of a degeneracy.Footnote 8 We overcome the degeneracy in MCMC marginalisation through three different prongs of attack that only rest upon on the likelihood or \(\chi ^2\). Here it is worth noting that MCMC is merely an algorithm, whereas the \(\chi ^2\) is a measure or metric of how well a point in parameter space fits the data. First, we provide a Bayesian model comparison based on the Akaike Information Criterion (AIC) [91] between the \(\Lambda \)CDM model and a \(\Lambda \)CDM model allowing a jump in cosmological parameters \((H_0, \Omega _m)\) at a fixed redshift. Despite the vanilla \(\Lambda \)CDM model being preferred by the AIC, the analysis demonstrates that an alternative model, even a physically ad hoc model that contradicts the basic fundamentals of FLRW, becomes more competitive if the \(\Lambda \)CDM fitting parameters change with effective redshift when confronted to data. Secondly, in a frequentist analysis we resort to a comparison between best fits of observed and mock data in the same redshift range with the same data quality to ascertain the significance of evolution. Finally, later in section 5, we employ profile distributions as a secondary frequentist approach.

4.1 Bayesian interpretation

One may interpret the results in Table 1 as a comparison between two models. The first is the \(\Lambda \)CDM model fitted over the entire redshift range of the SNe, \(0.00122 \le z \le 2.26137\), with three parameters \((H_0, \Omega _m, M)\), while the second is the \(\Lambda \)CDM model with a split at redshift \(z_{\text {split}}\) allowing the model to adopt different values of \((H_0, \Omega _m)\) above and below the split. Note that the likelihood (8) separates SNe in Cepheid host galaxies and their only role is to constrain M. For this reason, one is only fitting two effective parameters \((H_0, \Omega _m)\). Furthermore, by introducing the data split, we are comparing this effective two parameter model \((H_0, \Omega _m)\) with an effective five parameter model \((H^{(1)}_0, \Omega ^{(1)}_m, H^{(2)}_0, \Omega ^{(2)}_m, z_{\text {split}})\). Table 1 presents improvements in the \(\chi ^2\) without the normalisation corresponding to the logarithm of the determinant of the covariance matrix \(C_{\text {stat+sys}}\). Since we truncate out \(C_{\text {stat+sys}}\) entries when we split the SNe, this increases the normalisation, thereby penalising the model with the split beyond the 3 extra parameters introduced. We will quantify this number in turn, but only in competitive settings relative to the \(\Lambda \)CDM model where the improvement in \(\chi ^2\) in Table 1 is enough to overcome the additional parameters, i. e. \(\Delta \chi ^2 < -6\).Footnote 9

Table 1 Redshift splits of the Pantheon+ sample showing the number of SNe (excluding 77 calibrators), the best fit \(H_0\) and \(\Omega _m\) values, and differences in \(\chi ^2\) in low and high redshift subsamples. Changes in \(\chi ^2\) are with respect to best fits for the full sample with no split (see Table 2). M does not change as we decouple the calibrating SNe in the likelihood (8)

It should be noted that while model A is the vanilla \(\Lambda \)CDM model, the model B that serves as a foil to \(\Lambda \)CDM is a contradiction, because if \(H_0\) and \(\Omega _m\) change with effective redshift, this violates the mathematical requirement that both are integration constants. For this reason, model B could never replace \(\Lambda \)CDM. Nevertheless, the result is instructive as Bayesian model comparison is prevalent in the cosmology literature. That being said, the focus of this paper is performing a consistency check of the \(\Lambda \)CDM model and this does not necessitate a model B. What the analysis here shows is that we are getting close to a point in time where models incorporating evolution in the fitting parameters \(H_0\) and \(\Omega _m\) may be more competitive than \(\Lambda \)CDM, simply based on SNe data alone.

We recall the Akaike information criterion (AIC) [91],

$$\begin{aligned} \text {AIC} = {-2 \ln \mathcal {L}_{\text {max}} +2 d = \ln |C_{\text {stat+sys}}| +\chi _{\text {min}}^2+2 d}, \end{aligned}$$
(10)

where \(\chi _{\text {min}}^2\) is the minimum of the \(\chi ^2\), d is the number of free parameters and \(|C_{\text {stay+sys}}|\) denotes the determinant of the Pantheon+ covariance matrix \(C_{\text {stat+sys}}\). Since the latter is a constant, it has no bearing on the best fit parameters, but it impacts the AIC analysis.Footnote 10 However, since \(C_{\text {stat+sys}}\) is a large matrix with small numerical entries, determining the absolute value of \(|C_{\text {stat+sys}}|\) within machine precision is difficult. One can simplify the problem by noting that the \(77 \times 77\) matrix \(C_{\text {Cepheid}}\) is common to both the \(\Lambda \)CDM model and the \(\Lambda \)CDM model with a jump in cosmological parameters, so it contributes to both AIC values and drops out. Thus, we only need to the study the \(1624 \times 1624\) covariance matrix \(C_{\text {SN}}\), but this is still a large matrix with small numerical entries.

Since the \(\Lambda \)CDM model with a jump in cosmological parameters necessitates three additional parameters, i. e. \(\Delta d = 3\), this penalty can only be absorbed to give a lower AIC if \(\Delta \chi _{\text {min}}^2 < -6\). The results of splitting the Pantheon+ sample and fitting the \(\Lambda \)CDM model to data below and above \(z = z_{\text {split}}\) are shown in Table 1. We find that refitting the low redshift sample typically leads to small improvements in \(\chi ^2\), whereas refits of the high redshift sample lead to greater improvements. This outcome is expected if there is evolution across the sample; the evolution is only expected at higher redshifts because SNe samples have a low effective redshift, and as we have noted, SNe samples generically prefer Planck values \(\Omega _m \sim 0.3\). In particular, \(z_{\text {split}}=1\) gives rise to greatest reduction in \(\chi ^2_{\text {min}}\) with respect to the \(\Lambda \)CDM model without the split. However, we need to make sure that differences in \(\ln |C_{\text {SN}}|\) do not counter the improvement in \(\chi _{\text {min}}^2\).

To that end, consider

$$\begin{aligned} C_{\text {SN}} = \left( \begin{array}{cc} A &{} B \\ B^{T} &{} C \end{array} \right) \end{aligned}$$
(11)

where AB and C are respectively \(1599 \times 1599, 1599 \times 25\) and \(25 \times 25\)-dimensional matrices. Note that the dimensionalities are fixed by the choice of \(z_{\text {split}}=1\). The determinant of this block diagonal matrix is

$$\begin{aligned} |C_{\text {SN}}| = |A| \cdot |C-B^{T} A^{-1} B|, \end{aligned}$$
(12)

provided the matrix A is invertible. Note that when one introduces the split at \(z_{\text {split}}=1\), one sets \(B = 0\). As a result, the difference in the \(\ln |C_{\text {SN}}|\) is

$$\begin{aligned} \Delta \ln |C_{\text {stat+sys}}|= & {} \Delta \ln |C_{\text {SN}}|, \nonumber \\ {}= & {} \ln |C| - \ln |C-B^{T} A^{-1} B|, \nonumber \\= & {} -71.49-(-72.97) = 1.48, \end{aligned}$$
(13)

where \(\ln |A|\) contributes equally to competing AIC values and thus drops out. This removes the problem with machine precision leaving us a comparison of the logarithm of the determinant of smaller \(25 \times 25\) matrices. We are now left with an easy calculation. The AIC changes by \(\Delta \text {AIC} = \Delta \ln |C_{\text {stat+sys}}|+\Delta \chi ^2_{\text {min}} + 2 \Delta d = 1.5-1-6.2 +2 (5-2) = 0.3\), when one replaces the vanilla \(\Lambda \)CDM model with a (contradictory) \(\Lambda \)CDM model with a jump in the parameters \((H_0, \Omega _m)\) at \( z_{\text {split}} = 1\). Thus, despite the evolution seen in \((H_0, \Omega _m)\), Pantheon+ SNe data still has a marginal preference for the vanilla \(\Lambda \)CDM model over a physically “ad hoc model” with 3 additional parameters.

Given that our model B is not only a contradiction, but also has 3 additional parameters, it is not really a serious contender. That being said, the take-home message is clear. If the \(\Lambda \)CDM fitting parameters \((H_0, \Omega _m)\) change with effective redshift in a statistically significant way (see later analysis in Sect. 5 for confirmation), thereby failing our consistency check for a given split into low and high redshift subsamples, this opens the door for competing models. A physically motivated minimal extension of the \(\Lambda \)CDM model evidently may lead to a reversal in the conclusion that the \(\Lambda \)CDM model is preferred.

We close this section with additional comments. The change of \((H_0, \Omega _m)\) parameters with effective redshift constitutes a decreasing \(H_0\)/ increasing \(\Omega _m\) best fit trend with effective redshift. This is consistent with earlier analysis of the Pantheon SNe sample [24, 25]. Moreover, as is clear from Fig. 2 of [24] and Table 1, this trend begins at \(z = 0.7\). Upgrading the Pantheon to Pantheon+ samples has not changed this trend. A final point worth stressing is that best fits beyond \(z_{\text {split}} = 1\) prefer a \(\Lambda \)CDM model with negative DE densities, \(\Omega _m > 1\). This is simply a feature of the Pantheon+ [39, 40] data set, but since Risaliti-Lusso QSOs [30, 31] have a strong preference for \(\Omega _m > 1\) inferences in the \(\Lambda \)CDM model at high redshifts, the observations are consistent and both data sets warrant further study.

4.2 An illustration of MCMC bias

Having identified the split that enhances the improvement in fit, here we fix \(z_{\text {split}}=1\) and present MCMC posteriors for data above and below the split. In Fig. 2 the results of this exercise can be seen, where we have allowed for different uniform priors on \(\Omega _m\). There are a number of take-home messages. First, the low redshift (\(H_0, \Omega _m\)) posteriors are Gaussian, as expected, whereas the high redshift (\(H_0, \Omega _m\)) posteriors are not. Secondly, the peak of the \(\Omega _m\) posterior is found in the \(\Omega _m > 1\) regime, but it is robust to changes in the \(\Omega _m\) prior. Thus, imposing \(\Omega _m \le 1\) would simply cut off the peak in the high redshift \(\Omega _m\) posterior. Thirdly, the \(H_0\) posterior is sensitive to the \(\Omega _m\) prior. This is easy to understand as a projection effect. In short, as we relax the prior, the 2D MCMC posterior probes more of the top left corner of the \((H_0, \Omega _m)\)-plane. Configurations in this corner only differ appreciably in \(\Omega _m\), while getting projected onto more or less the same lower value of \(H_0\). Ultimately, what one concludes from the 2D MCMC posteriors is that the data is not good enough to constrain the model. The assumption then is that points in parameter space along the banana-shaped contour give rise to more or less the same values of \(\chi ^2\). As we shall show later, this assumption is false (see also [49]). Of course, the peak of the marginalised 1D \(H_0\) posterior cannot be tracking the minimum of the \(\chi ^2\) as its value is unique up to machine precision. We will now introduce two independent methodologies, mock simulations and profile distributions, which track the minimum of the \(\chi ^2\), and we will assess the statistical significance of evolution between low and high redshift subsamples.

Fig. 2
figure 2

MCMC posteriors for low and high redshift subsamples for the 2 cosmological parameters \((H_0, \Omega _m)\) and the 1 nuisance parameter M, the absolute magnitude of Type Ia SN. The low redshift posteriors are Gaussian, but the high redshift posteriors are not, in line with expectations. Extending the uniform \(\Omega _m\) prior leads to shifts in the peak of the \(H_0\) posterior due to a projection effect. Imposing the standard \(\Omega _m \le 1\) prior cuts off the peak of the \(\Omega _m\) distribution in the high redshift subsample

4.3 Frequentist interpretation

Here we adopt the same likelihood (8), but estimate the probability of finding a decreasing \(H_0\)/increasing \(\Omega _m\) best fit trend and negative DE densities as prominent in mock data. It should be noted that whenever one finds an unusual signal in cosmological data, it is standard practice to run mock simulations to ascertain if the signal is statistically significant or not. Here, the decreasing \(H_0\)/increasing \(\Omega _m\) trend in best fits is the unusual signal that we wish to test. Since we search for evolution trends and one expects little evolution at low z in mocks with good statistics, it is more efficient to remove low redshift SNe and restrict attention to the 210 SNe in the redshift range \(z > 0.5\). Thus, given a realisation of SNe data, we choose a cut-off redshift \(z_{\text {cut-off}}\) in the range \(z_{\text {cut-off}} \in \{0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2 \}\) and remove SNe with \(z \le z_{\text {cut-off}}\). This gives us 8 nested subsamples and for each subsample, we fit the \(\Lambda \)CDM model and record the best fit \((H_0, \Omega _m)\) values. We then construct the sums

$$\begin{aligned} {\sigma _{H_0} = \sum _{z_{\text {cut-off}}} ( H_0 - 73.41), \quad \sigma _{\Omega _m} = \sum _{z_{\text {cut-off}}} ( \Omega _m - 0.333),}\nonumber \\ \end{aligned}$$
(14)

where \(H_0\) and \(\Omega _m\) denote the best fits at each \(z_{\text {cut-off}}\), and the difference is relative to the best fits of the full sample (Table 2). See [24] for earlier analysis with the Pantheon sample, where similar sums were employed but with a fixed (not fitted) M. Sums close to zero correspond to realisations of the data with no specific trend that averages to zero. As is clear from Table 1, in Pantheon+ we see a decreasing \(H_0\) and increasing \(\Omega _m\) trend, so we expect \(\sigma _{H_0} < 0\) and \(\sigma _{\Omega _m} > 0\) in Pantheon+ SNe; the concrete numbers are \(\sigma _{H_0} = -115.50\) and \(\sigma _{\Omega _m} = 9.27\) to two decimal places. The advantage of constructing a sum is that it places no particularly importance on the choice of \(z_{\text {split}}\).

Table 2 Input parameters for our mocks. We construct an array of \((H_0, \Omega _m, M)\) values randomly in a normal distribution about each best fit with standard deviation specified by the error. Errors have been estimated through Fisher matrix

Our goal now is to construct Pantheon+ SNe mocks in the redshift range \(z > 0.5\) that are statistically consistent with no evolution in \((H_0, \Omega _m\)). To begin, we fit the full sample using the likelihood (8), identify best fits and \(1 \sigma \) confidence intervals through the inverse of a Fisher matrix (see [88]). We record the result in Table 2, noting that the result agrees almost exactly with (6), despite differences in the likelihood, i. e. (2) versus (8). Note, we could also run an MCMC chain, but we have already demonstrated that the errors are Gaussian in Sect. 2, so whether one uses an MCMC chain or random numbers generated in normal distributions from Table 2, one does not expect a great difference.

We next generate an array of 3000 \((H_0, \Omega _m, M)\) values randomly in normal distributions with central value corresponding to the best fit value and \(1 \sigma \) corresponding to the errors in Table 2. One could alternatively fix the injected cosmological parameters to the best fits in Table 2, but our approach here allows for greater randomness. For each entry in this array, we construct \(m_i = \mu _{\text {model}} (H_0, \Omega _m, z_i)+M\) for the 210 SNe in the redshift range \(0.5 < z \le 2.26137\). We then generate 210 new values of the apparent magnitude \(m_i\) by generating a random multivariate normal with the covariance matrix \(C_{\text {SN}}\) in (9) truncated from the Pantheon+ covariance matrix \(C_{\text {stat+sys}}\). This gives us one mock realisation of the data for each entry in our \((H_0, \Omega _m, M)\) array, which we fit back to the \(\Lambda \)CDM model for the nested subsamples in order to identify best fit parameters and the sums (14). Note, our mocking procedure drops correlations between \((H_0, \Omega _m, M)\), but this is not expected to make a big difference, since as can be seen from the yellow contour in Fig. 2, which is representative of the full sample, none of the parameters are strongly correlated. Moreover, we do not generate new SNe in Cepheid hosts, so M and its constraints are the same in mock and real data. This is justifiable because M should be insensitive to cosmologyFootnote 11 and here our focus is studying evolution of \((H_0, \Omega _m)\) best fits in high redshift cosmological data. Once this is done for all 3000 realisations, we count the number of mock realisations that give both \(\sigma _{H_0} \le -115.50\) and \(\sigma _{\Omega _{m}} \ge 9.27\). Essentially, by ranking the mocks by \(\sigma _{H_0}\) and \(\sigma _{\Omega _m}\), one can assign a percentile or probability to the observed Pantheon+ sample, just as one would do with the heights of children in a class. Note, in both these exercises it is unimportant what the probability density function (PDF) looks like, simply that numbers are smaller or larger than a certain number. In Fig. 3 we show the result of this exercise. As expected, our mock PDFs are peaked on \(\sigma _{H_0} = \sigma _{\Omega _m} = 0\). From 3000 mocks, we find 240 with more extreme values than the values we find in Pantheon+ SN. This gives us a p-value of \(p = 0.08\) (\(1.4 \sigma \) for a one-sided normal).

Fig. 3
figure 3

Sums (14) from 3000 mocks where the input parameters were picked in normal distributions consistent with Table 2. Red lines denote the corresponding values from Pantheon+

We next consider the likelihood of finding negative DE densities (\(\Omega _m > 1\)), as well as the likelihood of finding \(\Omega _m\) best fits as large as the Pantheon sample \( \Omega _m \gtrsim 3\), in the three final entries in Table 1. This can be done by recording best fit \(\Omega _m\) values from mocks with \(z_{\text {cut-off}} \in \{1.0, 1.1, 1.2\}\). From 3000 mocks, we find 298 that maintain \(\Omega _m > 1\) best fits and 77 that maintain larger \(\Omega _m\) best fits than the Pantheon+ sample. This gives us probabilities of \(p=0.1\) (\(1.3 \sigma \)) and \(p=0.026\) (\(1.9 \sigma \)), respectively. In other words, we find negative DE densities in the same redshift range one mock in 10 and larger \(\Omega _m\) best fits one mock in 38. In Fig. 4 we show a subsample of the mock best fits.

Fig. 4
figure 4

A sample of 300 mock best fits for SNe with \(z > z_{\text {cut-off}} \in \{1.0, 1.1, 1.2\}\). 6 mocks (blue) return \(\Omega _m\) best fits that remain above the \(\Omega _m\) best fits in Pantheon+ (solid black), 25 (green) that remain above \(\Omega _m = 1\) and 269 (red) where best fits are recorded below \(\Omega _m = 1\). We impose the bound \(0 \le \Omega _m \le 5\) and some points saturate these bounds

4.4 Pantheon+ covariance matrix and \(z_{\text {split}} = 1\)

Here we comment on how representative are the high redshift best fit \((H_0, \Omega _m)\) values if one splits the sample at \(z_{\text {split}}=1\). We perform this particular analysis so that we can directly compare to profile distributions in the next section. However, in the process we find a secondary result on the covariance matrix that is worth commenting upon. In contrast to the earlier sums, this means that we have singled out a particular redshift by hand and we are assessing the probability of a more specific event. For this reason we expect a probability less than \(p = 0.08\). Once again we perform mock analysis, but surprisingly find that none of best fits to 10,000 mocks fits the data as well as the real data. This may be partly due to the difference in preferred cosmological parameters, but is also expected to be due to a potential overestimation of the Pantheon+ covariance matrix [92].

Fig. 5
figure 5

Distribution of \(\chi ^2\) from 10,000 mocks of \(z > 1\) SNe data, where input parameters are picked in normal distributions consistent with Table 2. The red line corresponds to the value in the Pantheon+ sample. None of our mock data result in smaller \(\chi ^2\) values than real data

Concretely, we construct an array of 10,000 \((H_0, \Omega _m, M)\) mock input parameters by employing the best fits and \(1 \sigma \) confidence intervals in Table 2 as central values and standard deviations for normal distributions. For each entry in this array, we generate a new mock copy of the 25 data points in the Pantheon+ sample above \(z=1\), which we then fit back to the model and record 10,000 \((H_0, \Omega _m, M)\) best fits. Note, we are once again constructing high redshift subsamples that are representative of the full sample by construction. In Fig. 5 we show a comparison of the \(\chi ^2\) from mock data (blue PDF) versus \(\chi ^2\) from real data (red line); none of our mocks lead to lower values of \(\chi ^2\). Nevertheless, if one focuses on best fits, we find both smaller values of \(H_0\) and larger values of \(\Omega _m\) in 375 cases from 10,000 simulations, giving us a probability of \(p = 0.0375\) of finding more extreme best fits. This corresponds to \( 1.8 \sigma \) for a one-sided normal. In the next section we will compare this statistical significance to profile distributions with the same data in the same redshift range. The key point here is that profile distributions provides a consistency check of our mock simulations. In short, if our mocks are trustworthy, we expect to see a \(\sim 1.8 \sigma \) discrepancy in independent profile distribution analysis. A secondary point is that more extreme values of the \(\chi ^2\) are not found, which seems to support observations in [92] that the Pantheon+ covariance matrix is overestimated.

4.5 Restoring the covariance matrix

Earlier we truncated out an off-diagonal block from the Pantheon+ covariance matrix in likelihood (2) in order to decouple 77 SNe in Cepheid hosts from the remaining 1624 SNe and thus define the new likelihood (8). Since this is heavy handed if one only wants to focus on high redshift SNe, here we restore the off-diagonal entries in the covariance matrix. The results are shown in Table 3, where it is evident that the decreasing \(H_0\)/increasing \(\Omega _m\) best fit trend with effective redshift is robust beyond \(z_{\text {split}} = 0.7\). Moreover, we now find that SNe beyond \(z_{\text {split}} = 0.9\) return best fits consistent with negative DE density. We have relaxed the bounds on \(\Omega _m\) in order to accommodate best fits that saturate the bounds and the number of SNe excludes the 77 calibrating SNe. We also record a reduction in \(\chi ^2\) relative to the best fit values in Table 1, where the difference here is that we use (a truncation of) likelihood (2) and not likelihood (8). This provides a sanity check that our best fits are finding new minima as we change likelihood. Evidently, the re-introduction of off-diagonal entries in the covariance matrix impacts best fits, but not the features of interest. As noted in the previous section, the Pantheon+ covariance matrix appears overestimated [92].

Table 3 A repeat of the analysis in Table 1 that includes additional off-diagonal terms in \(C_{\text {stat+sys}}\) by only truncating the general likelihood (2) to 77 SNe in Cepheid host galaxies and SNe with redshift \(z > z_{\text {split}}\). \(\Delta \chi ^2\) is the difference between best fits in Table 1, where the likelihood (8) was employed. We relaxed the bound \(\Omega _m \le 1\) to accommodate the best fit values

5 Profile distributions

In this section we follow the methodology in [93], more specifically [49], where we refer the reader for further details. As explained in [38], removing low redshift H(z) or \(D_{L}(z)\) or \(D_{A}(z)\) data pushes the (flat) \(\Lambda \)CDM model into a non-Gaussian regime where projection effects are unavoidable. If one wants to test the constancy of \(\Lambda \)CDM cosmological parameters in the late Universe, and not simply resort to adopting a working assumption, then one has to overcome these effects. Profile distributions [93] allow one to construct probability density functions that are properly tracking the minimum of the \(\chi ^2\). The latter is by definition the point in model parameter space that best fits the data. As is clear from Fig. 2, where there is a degeneracy (banana-shaped contour) in the \((H_0, \Omega _m)\)-plane, the peak of the \(H_0\) posterior is sensitive to the prior, so it evidently tells one very little about the point in parameter space that best fits the data. Note that profile distributions [93] are simply a variant of profile likelihoods (see section 4 of Ref. [94]), where instead of optimising one recycles the MCMC chain. As a result, the input for both Bayesian and frequentist analysis is the information in the MCMC chain, thereby allowing a more direct comparison between the two approaches.

Here we focus on \(z_{\text {split}}=1\) as both our Bayesian and frequentist mock analysis suggests that this is the redshift split where evolution is most significant. Note, one can of course find sample splits with less evolution, but if one is interested in the self-consistency of a data set within the context of the \(\Lambda \)CDM model, it behoves us to focus on the most extreme cases. Following [49, 93] we fix a generous uniform prior \(\Omega _{m} \in [0, 8]\) and run a long MCMC chain for SNe with \(z > z_{\text {split}} = 1\). The prior has been chosen large enough so that the expected best fit \(\Omega _m \sim 3.4\) from Table 1 (\(z_{\text {split}} = 1\) row) can be recovered from the resulting distribution. We identify the minimum of the \( \chi ^2\), \(\chi ^2_{\text {min}}\), from the full MCMC chain. Next we break up the \(H_0\) and \(\Omega _m\) range into bins and record the lowest value of the \(\chi ^2\) in each bin, which gives us \(\chi ^2_{\text {min}}(H_0)\) and \(\chi ^2_{\text {min}}(\Omega _m)\), respectively. We can then define \(\Delta \chi ^2_{\text {min}}(H_0):= \chi ^{2}_{\text {min}}(H_0) - \chi ^2_{\text {min}}\) for \(H_0\) and an analogous \(\Delta \chi ^2_{\text {min}}(\Omega _m)\) for \(\Omega _m\). We next construct the distributions \(R(H_0) = e^{-\frac{1}{2} \Delta \chi ^2_{\text {min}}(H_0)}\) and \(R(\Omega _m) = e^{-\frac{1}{2} \Delta \chi ^2_{\text {min}}(\Omega _m)}\), which by construction are peaked at \(R(H_0) = R(\Omega _m) = 1\) in the bin with the overall minimum of the \(\chi ^2\) for the full MCMC chain.

Fig. 6
figure 6

\(R(H_0)\) and \(R(\Omega _m)\) distributions for high redshift \(z > 1\) SNe as a function of \(H_0\) and \(\Omega _m\). The black lines are the best fit values of the full Pantheon+ sample. Dashed, dotted and dashed-dotted lines denote \(1 \sigma , 2 \sigma \) and \(3 \sigma \), respectively

It should be stressed that it is easy to select large enough priors for \(H_0\) so that \(R(H_0)\) decays to zero within the priors. Nevertheless, as is clear from Fig. 2, \(\Omega _m\) distributions become broad in high redshift bins and the fall off may be extremely gradual. However, once one switches from MCMC posteriors to profile distributions, we are no longer worried about the volume of parameter space explored in MCMC marginalisation, but simply that each bin is populated and the minimum of the \(\chi ^2\) in each bin has been identified. Thus, it is enough that the MCMC algorithm visits all bins at least once and any empty bin we omit. Concretely, we allow for 200 bins for both \(H_0\) and \(\Omega _m\).

In Fig. 6 we show the unnormalised \(R(H_0)\) and \(R(\Omega _m)\) distributions for high redshift SNe with \(z > z_{\text {split}}=1\). The first point to appreciate is that the peaks of the distributions are close to the best fits in Table 1. Note, this provides a consistency check on the best fits, since extremising the \(\chi ^2\) through gradient descent and hopping around parameter space through MCMC marginalisation are independent. This provides a further test of the robustness of least squares fitting in this context (see also appendix). Secondly, as is evident from the dots to the left of the \(R(H_0)\) peak, \(R(H_0)\) goes to zero at both small and large values of \(H_0\) that are well within our priors. In contrast, as anticipated, the \(R(\Omega _m)\) distribution is almost constant beyond \(\Omega _m \sim 2\) but nevertheless shows a gradual fall off. The fall off towards smaller values of \(\Omega _m\) is considerably sharper. Thirdly, note that the dots essentially follow a curve, but some small bobbles are evident in bins. These features can be ironed out by running a longer MCMC chain. Finally, both \(R(H_0)\) and \(R(\Omega _m)\) confirm that the best fit for \(z > 1\) SNe is not connected to the best fit for the full sample (black lines) through a curve of constant \(\chi ^2\). Thus, we see a degeneracy in (Bayesian) MCMC analysis, but there is no counterpart in a frequentist treatment that involves the \(\chi ^2\). We conclude that it is misconception in the literature that a degeneracy in MCMC posteriors is equivalent to a constant \(\chi ^2\) curve. We remind the reader again that the \(\chi ^2\) is a measure of how well a point in parameter space fits the data.

We next turn our attention to assessing the statistical significance. The black lines in Fig. 6 denote the best fit values for the full sample from Table 2. Thus, these are the expected values if there is no evolution in the sample. To assess the evolution, we normalise the \(R(H_0)\) distribution by dividing through by the area under the full curve, which is most simply evaluated by numerically integrating using Simpson’s rule. We then impose a threshold \(\kappa \le 1\) and retain only the \(H_0\) bins with \(R(H_0) > \kappa \). Integrating under the curve for the retained \(H_0\) values and normalising accordingly one gets a probability p [49]. In Fig. 6 we use dashed, dotted and dashed-dotted lines to denote \(p \in \{0.68, 0.95, 0.997 \}\) corresponding to \(1 \sigma \), \(2 \sigma \) and \(3 \sigma \), respectively, in a Gaussian distribution. Evidently the best fit for the full sample (black line) is removed from the \(H_0\) peak by a statistical significance in the \(95 \%\) to \(99.7\%\) confidence level range. By adjusting the threshold \(\kappa \) further, one finds the area under the curve and the associated probability that terminates at the black line. We find that the black line is located at the \(97.2\%\) confidence level, the equivalent of \(2.2 \sigma \) for a Gaussian distribution. This can be directly compared with \(1.8 \sigma \) from our earlier analysis based on mock simulations. We note that there is a slight difference, but it is worth stressing that two independent techniques agree on a \(\sim 2 \sigma \) discrepancy.

In principle one could repeat the analysis with \(R(\Omega _m)\), but the distribution is broad and has been impacted by our priors. Changing the priors is expected to change the statistical significance of any inference using \(R(\Omega _m)\), so we omit the analysis. If this is unclear, note that restricting the range to \(\Omega _m \in [0, 4]\), would still allow a peak, but the dashed and dotted lines corresponding to \(68\%\) and \(95\%\) of the area under the curve would all shift. The robust take-away is that the peak of the \(R(\Omega _m)\) distribution coincides with negative DE density, \(\Omega _m > 1\). However, there is an important distinction here with MCMC. As we see from Fig. 2, due to a degeneracy in the 2D \((H_0, \Omega _m)\) posterior, changing the \(\Omega _m\) priors can impact the \(H_0\) posterior, whereas with profile distributions the number of times the MCMC algorithm visits a given \(H_0\) bin is unimportant, simply the minimum \(\chi ^2\) in the \(H_0\) bin is relevant. This important difference means that profile distributions are insensitive to changes in prior, modulo the fact that by changing the prior one either extends or cuts the distribution, but the peak does not move.

6 Discussion

The take-home message is that a decreasing \(H_0\)/increasing \(\Omega _m\) best fit trend observed in the Pantheon SNe sample [41] at low significance \(\sim 1 \sigma \) [24] (see [20,21,22, 26, 27, 37] for the \(H_0\) or \(\Omega _m\) trend alone) persists in the Pantheon+ sample [39, 40] with significance \(\sim 1.4 \sigma \) under similar assumptions that do not focus on a particular \(z_{\text {split}}\). Moreover, calibrated \(z >1\) SNe return \(\Omega _m > 1\) best fits, thereby signaling negative DE densities in the \(\Lambda \)CDM model. Note, this outcome is not overly surprising, because one cannot preclude \(\Omega _m > 1\) best fits at high redshifts even in mock Planck-\(\Lambda \)CDM data; beyond some redshift \(\Omega _m > 1\) best fits become probable. This is a mathematical feature of the \(\Lambda \)CDM model [25, 38]. Using profile distributions [93] (see also [49]), a technique which allows us to correct for projection and/or volume effects in MCMC marginalisation, we have independently confirmed the significance at \(\gtrsim 2 \sigma \). Similar features are evident in the literature, most notably Lyman-\(\alpha \) BAO [50] and QSOs standardised through fluxes in UV and X-ray [29,30,31]. Moreover, recent large SNe samples have led to larger \(\Omega _m\) values that are \(1.5 \sigma \) [95] to \(2 \sigma \) [96] discrepant with Planck [3]. From Fig. 4 of Ref. [96] it is obvious that the sample has a high effective redshift. Note, in contrast to [96], where the high effective redshift is an inherent property of the sample, here we deliberately increase the effective redshift of the Pantheon+ sample by binning it.

To put these results in context we return to the generic solution of the Friedmann equation [16],

$$\begin{aligned} H(z) = H_0 \exp \left( - \frac{3}{2} \int _0^{z} \frac{1+w_{\text {eff}}(z^{\prime })}{1+z^{\prime }} \text {d} z^{\prime } \right) , \end{aligned}$$
(15)

where \(w_{\text {eff}}(z)\) is the effective EoS. We observe that evolution of \(H_0\) (and \(\Omega _m\)) with effective redshift in the Pantheon+ sample is consistent with a disagreement between the assumed EoS, here the \(\Lambda \)CDM model, and H(z) inferred from Nature. These anomalies are not confined to SNe and we see related features elsewhere [18, 19, 24, 25]. Moreover, JWST is also reporting anomalies that may be cosmological in origin [97,98,99]; JWST anomalies may prefer a phantom DE EoS [99] (however see [100, 101]), which may be a proxy for negative DE densities at higher redshifts. If persistent cosmological tensions [5,6,7,8,9,10,11] are due to systematics, one expects no evolution in \(H_0\) from (15), but this runs contrary to what we are seeing. Our “evolution test”, which may be regarded as a consistency check of the \(\Lambda \)CDM model confronted to data, hence gives a complementary handle on establishing \(\Lambda \)CDM tensions, especially \(H_0\) tension. Note, it is routine to fit data sets in cosmology and simply assume that cosmological parameters are not evolving with effective redshift. Our analysis tests this assumption.

Admittedly, this one result may not be enough to falsify \(\Lambda \)CDM. That being said, if evolution is present, as our Bayesian model comparison shows, this opens up the door for finding alternative models that fit the data better than vanilla \(\Lambda \)CDM. On the contrary, without any change of \((H_0, \Omega _m)\) with redshift across expansive Type Ia SNe samples, as is the standard assumption in the literature, there is little hope of finding an alternative that beats \(\Lambda \)CDM in Bayesian model comparison. From this perspective, our consistency check then feeds into standard Bayesian analysis. However, there is a key difference. Physics demands that models are predictive, i. e. return the same fitting parameters at all epochs, whereas Bayesian methods only assess the goodness of fit and are cruder. Note also that the high redshift subsamples of Pantheon+ we study are small, so they are prone to statistical fluctuations. However, since we see similar trends beyond SNe [25], this makes a statistical fluctuation interpretation less likely. A second possibility is unexplored systematics in \(z > 1\) SNe identified largely through the Hubble Space Telescope (HST) [102,103,104,105]. There is unquestionable value in flagging these anomalies so that they can be explored. If one can eliminate these two, the only remaining possibility is that we must regard the trend as corroborating evidence that \(\Lambda \)CDM tensions are physical and the model is breaking down.

Going forward, if the next generation of SNe data [106] increases the statistical significance of the anomaly, as we have seen here in transitioning from Pantheon to Pantheon+, then there are interesting implications. First, any increasing trend in \(\Omega _m\) with effective redshift prevents one separating \(H_0\) and \(S_8 \propto \sqrt{\Omega _m}\) tensions. This is obvious. Interestingly, sign switching \(\Lambda \) models, which perform well alleviating \(H_0\)/\(S_8\) tensions [84], fit well with our main message here, i.e. negative DE at higher redshifts. Secondly, \(\Lambda \)CDM model breakdown allows us to re-evaluate the longstanding observational cosmological constant problem [107]. Thirdly, and most consequentially, it is likely that changes to the DE sector cannot prevent evolution in \(\Omega _m\), because DE is traditionally irrelevant at higher redshifts. Ultimately, if late-time DE does not or cannot come to the rescue [108], this brings the assumption of pressureless matter scaling as \(a^{-3}\) with scale factor a into question in late Universe FLRW cosmology. Finally, if the evolution of \(\Lambda \)CDM parameters we discussed here is substantiated in future, it rules out the so-called early resolutions to \(H_0\)/\(S_8\) tensions [13], such as early dark energy [109].