Skip to main content
Log in

Optimally adaptive Bayesian spectral density estimation for stationary and nonstationary processes

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

This article improves on existing Bayesian methods to estimate the spectral density of stationary and nonstationary time series assuming a Gaussian process prior. By optimising an appropriate eigendecomposition using a smoothing spline covariance structure, our method more appropriately models data with both simple and complex periodic structure. We further justify the utility of this optimal eigendecomposition by investigating the performance of alternative covariance functions other than smoothing splines. We show that the optimal eigendecomposition provides a material improvement, while the other covariance functions under examination do not, all performing comparatively well as the smoothing spline. During our computational investigation, we introduce new validation metrics for the spectral density estimate, inspired from the physical sciences. We validate our models in an extensive simulation study and demonstrate superior performance with real data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

Download references

Acknowledgements

Many thanks to Lamiae Azizi, Sally Cripps and Alex Judge for helpful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Max Menzies.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Percival–Walden AR process

Fig. 5
figure 5

Analytic and estimated log PSD for 10-eigenvector and optimal decomposition of the smoothing spline model for the Percival–Walden AR(4), defined in Appendix A. The optimal and penalised optimal smoothing splines coincide, with 23 eigenvectors. One can see that the spectral estimates each only have one spectral peak, while the analytic PSD has two, providing an example where the proximity matching criterion of Sect. 3.2 fails

In this brief section, we apply our methodology to a rather challenging autoregressive process that has been highlighted several times in the literature (Box et al. 2015; Percival and Walden 1993) and is commonly known as the Percival–Walden AR(4). This process is defined as \(x_{t} = 2.7607 x_{t-1} - 3.8106 x_{t-2} + 2.6535 x_{t-3} -0.9238 x_{t-4} + \epsilon _{t}\), simulated over length \(n=1024.\) As in Sect. 4, we simulate the process and validate our spectral density estimates against the known analytic power spectrum. In this experiment, the optimal and penalised optimal smoothing spline coincide, with 23 eigenvectors. The spectral estimates are plotted in Fig. 5 while validation metrics are provided in Table 4.

Table 4 Results for synthetic experiments on the Percival–Walden AR(4), defined in Appendix A. The optimal and penalised optimal splines coincide, with 23 eigenvectors. In this instance, the proximity matching criterion from Sect. 3.2 fails, so the distance between sets of peaks in Eq. (24) must be used. We include this distance both between sets of frequencies and amplitudes

This experiment also provides an example where the proximity matching criterion of Sect. 3.2 fails. We can simply observe that the analytic log power spectrum \({\mathbf {g}}\) has two peaks, while the spectral estimate \(\hat{{\mathbf {g}}}\) for both \(\hbox {Spline}_{{10}}\) and the (penalised) optimal smoothing spline have one peak each. As such, Table 4 includes the values of the semi-metric presented in (24) in Sect. 3.3. We observe that the (penalised) optimal smoothing spline provides a better approximation of the amplitudes of the two peaks than the existing method of Rosen et al.

Appendix B: Discussion of select existing methods

Alongside the statistics community, many signal processing practitioners and engineers have long been interested in the study of time series’ power spectra. Thus, it is worth noting the differences between a framework such as ours and frequentist or signal processing-based methods for power spectral density estimation.

We begin by describing Welch’s method, which is based upon Barlett’s method. Welch’s method aims to reduce noise in the resulting power spectral density estimate by sacrificing the degree of frequency resolution. The data are subdivided into overlapping segments, in each of which, a modified periodogram is computed. Each modified periodogram is averaged to produce a final estimate of the power spectral density. There are two model parameters in Welch’s method: the length of each segment and the degree of overlap in data points between adjacent segments.

Practically, Welch’s method has several limitations in comparison to Bayesian methods. First, the estimated power spectral density when using methods such as these may be less smooth (though not uniformly so). For scientists hoping to make observations with respect to the maximum amplitude and corresponding frequencies for any underlying time series, the often rough nature of Welch’s method may make inference more difficult. It is common in the Bayesian statistics literature to use a flexible prior function on the log of the power spectrum, such as a Gaussian process. The smoothness of the Gaussian process may be highly dependent on the covariance structure chosen by the modeller. Applying covariance functions such as the squared exponential and select Matern family variants allows for smooth interpolation in the resulting power spectral density estimate. Furthermore, recent research has shown that the variance is not a monotonic decreasing function of the fraction of overlap within adjacent segments (Barbe et al. 2010). Second, Welch’s method is unable to algorithmically partition the time series based on changes in the power spectral density. Procedures such as the RJMCMC introduced in this paper identify points in time where the power spectral density has changed. Hence, one would be unable to determine locations in the time domain which correspond to changes in the underlying periodic nature of a process, if one were to use Welch’s method.

That said, many practitioners in the signal processing literature use techniques such as wavelets in the case of implementing spectral density estimation in a nonstationary setting. For instance, the continuous wavelet transform has been applied for spectral analysis in nonstationary signals. Wavelets overcome the obvious limitation with Fourier transformation-driven methods, where abrupt changes in time series’ behaviour is difficult to capture (due to its underlying construction as a sum of sinusoidal waves). Unlike sine waves, which smoothly oscillate, wavelets are derived from “step functions” that exist for a finite duration, allowing for the efficient capture of abrupt changes in modelling tasks.

Third, many would argue that a Bayesian framework such as ours provides a more principled approach to uncertainty quantification than frameworks such as Welch’s method. The methodology proposed in this paper consists of uncertainty surrounding the power spectral density estimate, in addition to uncertainty surrounding the change point location. One clear advantage of Welch’s method in comparison with the method we have proposed (and other MCMC-based methods), however, is a significant computational advantage. While there are certainly frequentist methods to estimate the uncertainty in traditional signal processing methods, there are always individuals who prefer the posterior distributions provided by Bayesian methods, including just from a psychological perspective, including the ability to make probabilistic statements about unknown parameters (Wasserman 2004).

Another commonly used framework for spectral density estimation is the multitaper method. Multitaper analysis is an extensions of traditional taper analysis, where time series are tapered before applying a Fourier transformation as a method of reducing potential bias resulting from spectral leakage. The multitaper method averages over a variety of estimators with varying window functions (Thomson 1982; Mann and Lees 1996). This results in a power spectrum that exhibits reduced leakage and variance and retains important information from the initial and final sequences from the underlying time series. One major advantage of the multitaper method is that it can be applied in a fairly automatic manner, and is therefore appropriate in situations where many individual time series must be processed and a thorough analysis of each individual time series is not feasible. One possible limitation of the multitaper method is reduced spectral resolution. The multitaper method has proved to be an effective estimator in the presence of complex spectra. For example, Percival and Walden (1993) highlight the estimator’s effectiveness in detecting two peaks in the case of their AR(4) process described in Appendix A. As we saw, our methodology was unable to detect the two peaks. Of course, there are many techniques currently being used in addition to Welch’s method and the multitaper method described above. The choice between frequentist and Bayesian methods may depend on the precise problem and even the philosophical outlook of the practitioner. The literature is enriched by a robust continual development of both approaches.

Appendix C: Reversible jump sampling scheme

We follow Rosen et al. (2017, 2012) in our core implementation of the reversible jump sampling scheme. We remark that our method does not improve the trans-dimensional component of the model, described by the reversible jump scheme below. A time series partition is denoted \({\varvec{\xi }}_{m} = (\xi _{0,m},...,\xi _{m,m})\) with m segments. We have a vector of amplitude parameters \({\varvec{\tau }}_{m}^{2} = (\tau _{1,m}^{2},...,\tau _{m,m}^{2})'\) and regression coefficients \({\varvec{\beta }}_{m} = ({\varvec{\beta '}}_{1,m},...,{\varvec{\beta '}}_{m,m})\) that we wish to estimate, for the jth component within a partition of m segments, \(j=1,...,m.\) For notational simplicity, \({\varvec{\beta }}_{j,m}, j=1,...,m,\) is assumed to include the first entry, \(\alpha _{0j,m}.\) In the proceeding sections, superscripts c and p refer to current and proposed value in the sampling scheme.

First, we describe the between-model moves: let \({\varvec{\theta }}_{m} = ({\varvec{\xi }}'_{m}, {\varvec{\tau }}^{2'}_{m}, {\varvec{\beta '}}_{m})\) be the model parameters at some point in the sampling scheme and assume that the chain starts at \((m^c, {\varvec{\theta }}_{m^c}^{c})\). The algorithm proposes the move to \((m^p, {\varvec{\theta }}_{m^p}^p)\), by drawing \((m^p, {\varvec{\theta }}_{m^p}^p)\) from the proposal distribution

\(q(m^p, {\varvec{\theta }}_{m^p}^p|m^c, {\varvec{\theta }}_{m^c}^c)\). That draw is accepted with probability

$$\begin{aligned} \alpha = \text { min } \Bigg \{1, \frac{p(m^{p}, {\varvec{\theta }}_{m^p}^{p}|{\varvec{x}}) q(m^c, {\varvec{\theta }}_{m^c}^{c}|m^p, {\varvec{\theta }}_{m^p}^{p})}{p(m^{c}, {\varvec{\theta }}_{m^c}^{c}|{\varvec{x}}) q(m^p, {\varvec{\theta }}_{m^p}^{p}|m^c, {\varvec{\theta }}_{m^c}^{c})} \Bigg \}, \end{aligned}$$

with \(p(\cdot )\) referring to a target distribution, the product of the likelihood and the prior. The target and proposal distributions will vary based on the type of move taken in the sampling scheme. First, \(q(m^p, {\varvec{\theta }}_{m^{p}}^{p}| m^c, {\varvec{\theta }}_{m^{c}}^{c})\) is described as follows:

$$\begin{aligned}&q(m^p, {\varvec{\theta }}_{m^p}^{p}|m^c, {\varvec{\theta }}_{m^{c}}^{c}) = q(m^p|m^c) q({\varvec{\theta }}_{m^p}^{p}| m^p, m^c, {\varvec{\theta }}_{m^c}^{c}) \\&\quad = q(m^p|m^c) q({\varvec{\xi ^p_{m^p}}}, {\varvec{\tau }}_{m^p}^{2p}, {\varvec{\beta }}_{m^p}^{p} | m^p, m^c, {\varvec{\theta }}_{m^c}^c) \\&\quad = q(m^p|m^c) q({\varvec{\xi }}_{m^p}^p|m^p, m^c, {\varvec{\theta }}_{m^c}^c) q({\varvec{\tau }}_{m^p}^{2p}|{\varvec{\xi }}_{m^p}^p, m^p, m^c, {\varvec{\theta }}_{m^c}^c) \\&\quad \times q({\varvec{\beta }}_{m^p}^{p}|{\varvec{\tau }}_{m^p}^{2p}, {\varvec{\xi }}_{m^p}^p, m^p, m^c, {\varvec{\theta }}_{m^c}^c). \end{aligned}$$

To draw \((m^p, {\varvec{\theta }}_{m^p}^p)\) one must first draw \(m^p\), followed by \({\varvec{\xi }}_{m^p}^p\), \(\tau _{m^p}^{2p}, \text { and } {\varvec{\beta }}_{m^p}^{p}\). First, the number of segments \(m^p\) is drawn from the proposal distribution \(q(m^p|m^c)\). Let M be the maximum number of segments and \(m^{c}_{2,\text {min}}\) be the number of current segments whose cardinality is at least \(2 t_{min}\) data points. The proposal is as follows:

$$\begin{aligned}&q(m^p = k | m^c) \\&\quad =\left\{ \begin{array}{ll} 1/2 \text { if } k = m^c - 1, m^c + 1 \text { and } m^{c} \ne 1, M, m_{2,\text {min}}^c \ne 0\\ 1 \text { if } k = m^{c}-1 \text { and } m^{c} = M \text { or } m_{2,\text {min}}^{c} = 0 \\ 1 \text { if } k = m^{c} + 1 \text { and } m^{c} = 1 \end{array} \right. \end{aligned}$$

Conditional on the proposed model \(m^p\), a new partition \({\varvec{\xi }}_{m^p}^{p}\), a new vector of covariance amplitude parameters \({\varvec{\tau }}_{m^p}^{2p}\) and a new vector of regression coefficients, \({\varvec{\beta }}_{m^p}^{p}\) are proposed. In Rosen et al. (2012), \(\tau ^{2}\) is referred to as a smoothing parameter. To impact the smoothness of the covariance function, the parameter would have to impact pairwise operations. Given that \(\tau ^{2}\) sits outside the covariance matrix, we will refer to \(\tau ^{2}\) as an amplitude parameter (akin to signal variance within the Gaussian process framework Rasmussen and Williams 2005).

Now, we describe the process of the birth of new segments. Suppose that \(m^p = m^c + 1\). A time series partition,

$$\begin{aligned} {\varvec{\xi }}^{p}_{m^p} = (\xi ^c_{0,m^c},...,\xi ^c_{k^{*}-1,m^c},\xi _{k^{*},m^{p}}^{p}, \xi _{k^{*},m^{c}}^{c},...,\xi _{m^c,m^c}^{c}) \end{aligned}$$

is drawn from the proposal distribution \(q({\varvec{\xi }}_{m^p}^{p}|m^p, m^c, {\varvec{\theta }}_{m^c}^c)\). The algorithm proposes a partition by first selecting a random segment \(j = k^{*}\) to split. Then, a point \(t^{*}\) within the segment \(j=k^{*}\) is randomly selected to be the proposed partition point. This is subject to the constraint,

\(\xi _{k^{*}-1, m^c}^{c} + t_{\text {min}} \le t* \le \xi _{k^{*},m^c}^c - t_{\text {min}}\). The proposal distribution is computed as follows:

$$\begin{aligned} q(\xi _{j,m^p}^{p}&= t^{*} | m^p, m^c, {\varvec{\xi }}_{m^c}^{c}) = p(j=k^{*} | m^p, m^c, {\varvec{\xi }}_{m^c}^{c}) \\ p(\xi _{k^{*}, m^p}^{p}&= t^{*} | j=k^{*}, m^p, m^c, {\varvec{\xi }}_{m^c}^c) \\&= \frac{1}{m_{2 \text {min}}^{c}(n_{k^{*}, m^c} - 2t_{\text {min}}+1)}. \end{aligned}$$

The vector of amplitude parameters

$$\begin{aligned}&\tau _{m^p}^{2p} = (\tau _{1,m^c}^{2c},...,\tau _{k^{*}-1,m^c}^{2c}, \\&\tau _{k^{*},m^p}^{2p}, \tau _{k^{*}+1,m^p}^{2p}, \tau _{k^{*}+1,m^c}^{2c},...,\tau _{m^c, m^c}^{2c}) \end{aligned}$$

is drawn from the proposal distribution

\(q({\varvec{\tau }}_{m^p}^{2p}|m^p, {\varvec{\xi }}_{m^p}^p, m^c, {\varvec{\theta }}_{m^c}^c) = q({\varvec{\tau }}_{m^p}^{2p}|m^p, {\varvec{\tau }}_{m^c}^{2c}).\) The algorithm is based on the reversible jump algorithm of Green (1995). It draws from a uniform distribution \(u \sim U[0,1]\) and defines \(\tau _{k^{*}, m^p}^{2p}\) and \(\tau _{k^{*}+1, m^p}^{2p}\) in terms of u and \(\tau _{k^{*}, m^c}^{2c}\) as follows:

$$\begin{aligned} \tau _{k^{*, m^p}}^{2p}&= \frac{u}{1-u}\tau _{k^{*}, m^c}^{2c}; \end{aligned}$$
(27)
$$\begin{aligned} \tau _{k^{*}+1, m^p}^{2p}&= \frac{1-u}{u}\tau _{k^{*}, m^c}^{2c}. \end{aligned}$$
(28)

The vector of coefficients

$$\begin{aligned} {\varvec{\beta }}_{m^p}^p&= ({\varvec{\beta }}_{1,m^c}^{c},...,{\varvec{\beta }}_{k^{*}-1,m^c}^{c}, \\&{\varvec{\beta }}_{k^{*}, m^p}^p, {\varvec{\beta }}_{k^{*}+1,m^p}^{p}, {\varvec{\beta }}_{k^{*}+1,m^c}^{c},...,{\varvec{\beta }}_{m^c,m^c}^{c}) \end{aligned}$$

is drawn from the proposal distribution

\(q({\varvec{\beta }}_{m^p}^p|{\varvec{\tau }}_{m^p}^{2p},{\varvec{\xi }}_{m^p}^{2p},m^p, m^c, {\varvec{\theta }}_{m^c}^c) = q({\varvec{\beta }}_{m^p}^{p}|{\varvec{\tau }}_{m^p}^{2p}, {\varvec{\xi }}_{m^p}^p, m^p)\). The pair of vectors \({\varvec{\beta }}_{k^{*}, m^p}^p\) and \({\varvec{\beta }}_{k^{*}+1, m^p}^p\) are drawn from Gaussian approximations to the respective posterior conditional distributions \(p({\varvec{\beta }}_{k^{*}, m^p}^p|{\varvec{x}}_{k^{*}}^p, \tau _{k^{*}, m^p}^{2p}, m^p)\) and

\(p({\varvec{\beta }}_{k^{*}+1, m^p}^{p}|{\varvec{x}}_{k^{*}+1}^p, \tau _{k^{*}+1, m^p}^{2p}, m^p)\), respectively. Here, \({\varvec{x}}_{k^{*}}^p\) and \({\varvec{x}}_{k^{*}+1}^p\) refer to the subsets of the time series with respective segments \(k^{*}\) and \(k^{*}+1\). \({\varvec{\xi }}_{m^p}^p\) will determine \({\varvec{x_{*}}}^p = ({\varvec{x}}_{k^{*}}^{p'}, {\varvec{x}}_{k^{*}+1}^{p'})'\). For the sake of exposition, we provide the following example: the coefficient \({\varvec{\beta }}_{k^{*}, m^p}^{p}\) is drawn from the Gaussian distribution \(N({\varvec{\beta }}_{k^{*}}^{\text {max}}, \Sigma _{k^{*}}^{\text {max}})\), where \({\varvec{\beta }}_{k^{*}}^{\text {max}}\) is defined as

$$\begin{aligned} {{\,\mathrm{argmax}\,}}_{{\varvec{\beta }}_{k^{*}, m^p}^{p}} p({\varvec{\beta }}^p_{k^{*}, m^p}|{\varvec{x}}_{k^{*}}^p, \tau _{k^{*}, m^p}^{2p}, m^p) \end{aligned}$$

and

$$\begin{aligned} \Sigma _{k^{*}}^{\text {max}} =-\Bigg \{ \frac{\partial ^{2}{\log p({\varvec{\beta }}_{k^{*}, m^p}^p | {\varvec{x}}_{k^{*}}^p, \tau _{k^{*}, m^p}^{2p}, m^p)}}{{\varvec{\beta }}_{k^{*}, m^p}^{p}{{\varvec{\beta }}_{k^{*}, m^p}^{p'}}} \Bigg |_{{\varvec{\beta }}_{k^{*}, m^p}^{p} = {\varvec{\beta }}_{k^{*}}^{\text {max}}} \Bigg \}_.^{-1} \end{aligned}$$

For the birth move, the probability of acceptance is \(\alpha = \min \{1,A\}\), where A is equal to

$$\begin{aligned}&\Bigg | \partial {(\tau _{k^{*}, m^p}^{2p}, \tau _{k^{*}+1, m^p}^{2p})}{(\tau _{k^{*}, m^c, u}^{2c})} \Bigg |\frac{p({\varvec{\theta }}_{m^p}^p|{\varvec{x}}, m^p) p({\varvec{\theta }}_{m^p}^p|m^p)p(m^p)}{p({\varvec{\theta }}_{m^p}^p|{\varvec{x}}, m^p) p({\varvec{\theta }}_{m^c}^c|m^c)p(m^c)} \\&\quad \times \frac{p(m^{c}|m^p)p({\varvec{\beta }}_{k^{*}, m^c}^{c})}{p(m^p|m^c)p(\xi _{k^{*}, m^p}^{m^p}|m^p, m^c) p(u) p({\varvec{\beta }}^p_{k^{*}, m^p})p({\varvec{\beta }}_{k^{*}+1,m^{p}}^{p})}. \end{aligned}$$

Above, \(p(u) = 1, 0 \le u \le 1,\) while \(p({\varvec{\beta }}_{k^{*}, m^p}^{p})\) and \(p({\varvec{\beta }}_{k^{*}+1, m^p}^{p})\) are Gaussian proposal distributions \(N({\varvec{\beta }}_{k^{*}}^{\text {max}}, \Sigma _{k^{*}}^{\text {max}})\) and

\(N({\varvec{\beta }}_{k^{*}+1}^{\text {max}}, \Sigma _{k^{*}+1}^{\text {max}})\), respectively. The Jacobian is computed as

$$\begin{aligned}&\bigg | \partial {(\tau _{k^{*}, m^p}^{2p}, \tau _{k^{*}+1, m^p}^{2p})}{(\tau ^{2c}_{k^{*}, m^c}, u)} \bigg |\\&\quad = \frac{2 \tau _{k^{*}m^c}^{2c}}{u(1-u)} = 2(\tau _{k^{*}, m^p}^{p} + \tau _{k^{*}+1, m^p}^{p})^{2}. \end{aligned}$$

Next, we describe the process of the death of new segments, that is, the reverse of a birth move, where \(m^p = m^c - 1\). A time series partition

$$\begin{aligned} {\varvec{\xi }}_{m^p}^{p} = (\xi _{0,m^c}^{c},...,\xi _{k^{*}-1,m^c}^{c}, \xi _{k^{*}+1,m^c}^{c},...,\xi _{m^c,m^c}^{c}), \end{aligned}$$

is proposed by randomly selecting a single partition from \(m^c - 1\) candidates, and removing it. The partition point selected for removal is denoted \(j=k^{*}\). There are \(m^c -1\) possible segments available for removal among the \(m^c\) segments currently in existence. The proposal may choose each partition point with equal probability, that is,

$$\begin{aligned} q(\xi _{j, m^p}^p|m^p, m^c, {\varvec{\xi }}_{m^c}^c) = \frac{1}{m^c - 1}. \end{aligned}$$

The vector of amplitude parameters

$$\begin{aligned} {\varvec{\tau }}_{m^p}^{2p} = (\tau _{1, m^c}^{2c},...,\tau _{k^{*}-1,m^c}^{2c},\tau _{k^{*},m^p}^{2c}, \tau _{k^{*}+2,m^c}^{2c},...,\tau _{m^c,m^c}^{2c}) \end{aligned}$$

is drawn from the proposal distribution

\(q({\varvec{\tau }}_{m^p}^{2p}|m^p, {\varvec{\xi }}_{m^p}^p, m^c, {\varvec{\theta }}_{m^c}^{c}) = q({\varvec{\tau }}_{m^p}^{2p}| m^p, {\varvec{\tau }}_{m^c}^{2c})\). One amplitude parameter \(\tau _{k^{*}, m^p}^{2p}\) is formed from two candidate amplitude parameters, \(\tau _{k^{*},m^c}^{2c}\) and \(\tau _{k^{*}+1,m^c}^{2c}\). This is done by reversing the equations 27 and 28. That is,

$$\begin{aligned} \tau _{k^{*}, m^p}^{2p} = \sqrt{\tau _{k^{*}, m^{c}}^{2c} \tau _{k^{*}+1, m^c}^{2c}}. \end{aligned}$$

Finally, the vector of regression coefficients,

$$\begin{aligned} {\varvec{\beta }}_{m^p}^{p} = (\beta _{1,m^c}^{c},...,\beta _{k^{*}-1,m^c}^{c}, \beta _{k^{*}, m^p}^{p}, \beta _{k^{*}+2,m^c}^{c},...,\beta _{m^c,m^c}^{c}) \end{aligned}$$

is drawn from the proposal distribution

\(q({\varvec{\beta }}_{m^p}^p|{\varvec{\tau }}_{m^p}^{2p}, {\varvec{\xi }}_{m^p}^p, m^p, m^c, \theta _{m^c}^c) = q({\varvec{\beta }}_{m^p}^{p}|{\varvec{\tau }}_{m^p}^{2p}, {\varvec{\xi }}_{m^p}^p, m^p)\). The vector of regression coefficients is drawn from a Gaussian approximation to the posterior distribution

\(p(\beta _{k^{*},m^p}|{\varvec{x}}, \tau _{k^{*}, m^p}^{2p}, {\varvec{\xi }}^p_{m^p}, m^p)\) following the same procedure for the vector of coefficients in the birth step. The probability of acceptance is the inverse of the analogous birth step. If the move is accepted then the following updates occur: \(m^c=m^p\) and \({\varvec{\theta }}_{m^c}^c = {\varvec{\theta }}_{m^p}^{p}\).

Finally, we describe the within-model moves: henceforth, m is fixed; accordingly, notation describing the dependence on the number of segments is removed. There are two parts to a within-model move. First, a segment relocation is performed, and conditional on the relocation, the basis function coefficients are updated. The steps are jointly accepted or rejected with a Metropolis–Hastings step. The amplitude parameters are updated within a separate Gibbs sampling step.

The chain is assumed to be located at \({\varvec{\theta }}^{c} = ({\varvec{\xi }}^{c}, {\varvec{\beta }}^{c})\). The proposed move \({\varvec{\theta }}^p = ({\varvec{\xi }}^p, {\varvec{\beta }}^p)\) is as follows: first, a partition point \(\xi _{k^{*}}\) is selected for relocation from \(m-1\) candidate partition points. Next, a position within the interval \([\xi _{k^{*}-1}, \xi _{k^{*}+1}]\) is selected, subject to the fact that the new location is at least \(t_{\text {min}}\) data points away from \(\xi _{k^{*}-1}\) and \(\xi _{k^{*}+1}\), so that

$$\begin{aligned} \Pr (\xi ^p_{k^{*}}=t) = \Pr (j=k^{*}) \Pr (\xi _{k^{*}}^{p}=t|j=k^{*}), \end{aligned}$$

where \(\Pr (j=k^{*}) = (m-1)^{-1}\). A mixture distribution for \(\Pr (\xi _{k^{*}}^p=t|j=k^{*})\) is constructed to explore the space most efficiently, so

$$\begin{aligned}&\Pr (\xi _{k^{*}}^{p}=t|j=k^{*}) = \\&\pi q_1 (\xi _{k^{*}}^p = t| \xi _{k^{*}}^{c}) + (1-\pi ) q_2 (\xi _{k^{*}}^p=t|\xi _{k^{*}}^c), \end{aligned}$$

where \(q_1(\xi _{k^{*}}^p = t| \xi _{k^{*}}^c) = (n_{k^{*}} + n_{k^{*}+1}-2t_{\text {min}} + 1)^{-1}, \xi _{k^{*}-1} + t_{\text {min}} \le t \le \xi _{k^{*}+1} - t_{\text {min}}\) and

$$\begin{aligned}&q_2(\xi _{k^{*}}^p = t|\xi _{k^{*}}^{c}) \\&\quad =\left\{ \begin{array}{ll} 0 \text { if } |t-\xi ^c_{k^{*}}| > 1 \\ 1/3 \text { if } |t-\xi ^c_{k^{*}}| \le 1, n_{k^{*}} \ne t_{\text {min}} \text { and } n_{k^{*}+1} \ne t_{\text {min}} \\ 1/2 \text { if } t-\xi _{k^{*}}^{c} \le 1, n_{k^{*}} = t_{\text {min}} \text { and } n_{k^{*}+1} \ne t_{\text {min}} \\ 1/2 \text { if } \xi _{k^{*}}^{c} - t \le 1, n_{k^{*}} \ne t_{\text {min}} \text { and } n_{k^{*}+1} = t_{\text {min}} \\ 1 \text { if } t = \xi _{k^{*}}^{c}, n_{k^{*}} = t_{\text {min}} \text { and } n_{k^{*}+1} = t_{\text {min}} \end{array} \right. \end{aligned}$$

The support of \(q_1\) has \(n_{k^{*}} + n_{k^{*}+1} - 2t_{\text {min}} + 1\) data points while \(q_2\) has at most three. The term \(q_2\) alone would result in a high acceptance rate for the Metropolis–Hastings, but it would explore the parameter space slowly. The \(q_1\) component allows for larger jumps, and produces a compromise between a high acceptance rate and thorough exploration of the parameter space.

Next, \({\varvec{\beta ^{p}_{j}}}, j=k^{*}, k^{*}+1\) is drawn from an approximation to \(\prod ^{k^{*}+1}_{j=k^{*}} p({\varvec{\beta }}_j|{\varvec{x}}_j^p, \tau _j^{2})\), following the analogous step in the between-model move. The proposal distribution, which is evaluated at \({\varvec{\beta }}^{p}_j, j=k^{*}, k^{*}=1\), is

$$\begin{aligned} q({\varvec{\beta }}_{*}^{p}|{\varvec{x}}_{*}^p, {\varvec{\tau }}_{*}^{2}) = \prod ^{k^{*}+1}_{j=k^{*}} q({\varvec{\beta }}_{j}^p|{\varvec{x}}_j^p, \tau _j^{2}), \end{aligned}$$

where \({\varvec{\beta }}_{*}^p = ({\varvec{\beta }}^{p'}_{k^{*}}, {\varvec{\beta }}^{p'}_{k^{*}+1})'\) and \({\varvec{\tau }}_{*}^{2} = (\tau ^{2}_{k^{*}}, \tau ^{2}_{k^{*}+1})'\). The proposal distribution is evaluated at current values of \({\varvec{\beta }}_{*}^{c} = (\beta ^{c'}_{k^{*}}, \beta ^{c'}_{k^{*}+1})'\). \(\beta _{*}^p\) is accepted with probability

$$\begin{aligned} \alpha = \min \Bigg \{ 1, \frac{p({\varvec{x}}_{*}^p|{\varvec{\beta }}_{*}^{p}) p({\varvec{\beta }}_{*}^{p}|{\varvec{\tau }}_{*}^{2}) q({\varvec{\beta }}_{*}^{c}|{\varvec{x}}^{c}_{*}, {\varvec{\tau }}_{*}^{2})}{p({\varvec{x}}_{*}^c|{\varvec{\beta }}_{*}^{c}) p({\varvec{\beta }}_{*}^{c}|{\varvec{\tau }}_{*}^{2}) q({\varvec{\beta }}_{*}^{p}|{\varvec{x}}^{p}_{*}, {\varvec{\tau }}_{*}^{2})} \Bigg \}, \end{aligned}$$

where \({\varvec{x}}_{*}^{c} = ({\varvec{x}}^{c'}_{k^{*}},{\varvec{x}}^{c'}_{k^{*}+1})\). When the draw is accepted, update the partition and regression coefficients \((\xi ^{c}_{k^{*}}, \beta _{*}^{c}) = (\xi ^{p}_{k^{*}}, \beta _{*}^{p})\). Finally, draw \(\tau ^{2p}\) from

$$\begin{aligned} p(\tau _{*}^{2}|{\varvec{\beta }}_{*}) = \prod ^{k^{*}+1}_{j=k^{*}} p(\tau _j^{2}|\beta _j). \end{aligned}$$

This is a Gibbs sampling step, and accordingly the draw is accepted with probability 1.

Appendix D: Metropolis–Hastings algorithm

In this section, we describe the Metropolis–Hastings algorithm used in the stationary case for our simulation study (Sect. 4). As seen in Appendix C, the above RJMCMC reduces to a Metropolis–Hastings in the absence of the between-model moves.

We estimate the log of the spectral density by its posterior mean via a Bayesian approach and an adaptive MCMC algorithm:

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}({\mathbf {g}}|{\mathbf {y}}) = \int {{\,\mathrm{{\mathbb {E}}}\,}}({\mathbf {g}}|{\mathbf {y}}, \mathbf {\theta }) p(\mathbf {\theta }|{\mathbf {y}}) \mathrm{d}\theta \,\, \simeq \,\, \frac{1}{M} \sum _{j=1}^{M} {{\,\mathrm{{\mathbb {E}}}\,}}(\hat{{\mathbf {g}}}|{\mathbf {y}}, \mathbf {\theta }^{j}). \end{aligned}$$
(29)

Here, M is the number of post-burn-in iterations in the MCMC scheme; \(\mathbf {\theta }^{j}\) are samples taken from the posterior distribution \(p(\mathbf {\theta }|{\mathbf {y}})\); \(p(\mathbf {\theta })\) is taken from a Gaussian distribution \(N(\mu , \sigma ^{2})\) centred around \(\mu =\theta ^{[c]}\) that maximises the log marginal likelihood; \(\sigma \) is chosen arbitrarily; and \(\hat{{\mathbf {g}}}\) is the forecast log spectrum.

Monte Carlo algorithms have been highly prominent for estimation of hyperparameters and spectral density in a nonparametric Bayesian framework. Metropolis et al. (1953) first proposed the Metropolis algorithm; this was generalised by Hastings in a more focused, statistical context (Hastings 1970). The random walk Metropolis–Hastings algorithm aims to sample from a target density \(\pi \), given some candidate transition probability function q(xy). In our context, \(\pi \) represents the Whittle likelihood function multiplied by respective priors. The acceptance ratio is:

$$\begin{aligned} \alpha (x,y) = {\left\{ \begin{array}{ll} \min \bigg ( \frac{\pi (y)q(y,x)}{\pi (x) q(x,y)}, 1 \bigg ) \text { if } \pi (x) q(x,y) > 0 \\ 1 \text { if } \pi (x) q(x,y) = 0. \end{array}\right. } \end{aligned}$$
(30)

Our MCMC scheme calculates the acceptance ratio every 50 iterations; based on an ideal acceptance ratio (Roberts and Rosenthal 2004), the step size is adjusted for each hyperparameter. The power spectrum and hyperparameters for the GP covariance function are calculated in each iteration of our sampling scheme and integrated over.

First, we initialise the values of our GP hyperparameters, \(\theta ^{[c]}\), our log PSD \({\hat{g}}^{c}\), our random perturbation size \(s^{2}\), and the adaptive adjustment for our step size \(\xi \). Starting values for our GP hyperparameters are chosen based on maximising the marginal Whittle likelihood,

$$\begin{aligned} \theta ^{[c]}= & {} {{\,\mathrm{argmax}\,}}_{\theta } - (2 \pi )^{-m/2} \prod _{j=0}^{m-1}\nonumber \\&\exp \left( {-\frac{1}{2}\left[ \log f(\nu _{j}) + \frac{I(\nu _{j})}{f(\nu _{j})}\right] }\right) . \end{aligned}$$
(31)

The latent log PSD is modelled with a zero mean GP, That is, \({\mathbf {g}} \sim GP(0,k_{\theta }(x,x'))\), and we follow the notation of Rasmussen and Williams (2005) where \(k(x,x')\) refers to any respective kernel. The adaptive MCMC algorithm samples from the posterior distribution of the log PSD and the posterior distribution of any candidate covariance function’s hyperparameters. First, the current and proposed values for the mean and covariance of the Gaussian process are computed. That is, the mean is computed:

$$\begin{aligned} \hat{g^{c}} = k_{\theta ^c}(x',x)[k_{\theta ^c}(x,x) + \sigma ^{2}I]^{-1}\log I(\mathbf {\nu }), \end{aligned}$$
(32)

and the covariance is computed:

$$\begin{aligned} \hat{V^{c}} = k_{\theta ^c}(x',x')-k_{\theta ^c}(x',x)[k_{\theta ^c}(x,x) + \sigma ^{2}I]^{-1}k_{\theta ^c}(x,x').\nonumber \\ \end{aligned}$$
(33)

New proposals for GP hyperparameters are determined via a random walk proposal distribution. A zero-mean Gaussian distribution is used to generate candidate perturbations, where \(s^2\) is the variance of this Gaussian. That is

$$\begin{aligned} \theta ^{p} \xleftarrow []{} q(\theta ^p|\theta ^c). \end{aligned}$$
(34)

Having drawn the proposed GP hyperparameters, a proposed mean and covariance function of the log PSD are drawn from the posterior distribution of the GP. Both the proposed mean and covariance are computed similarly to the current values, simply replacing the values of the hyperparameters \(\theta ^{c} \xleftarrow []{} \theta ^{p}\). So, the proposed mean of the log PSD is,

$$\begin{aligned} \hat{g^{p}} = k_{\theta ^p}(x',x)[k_{\theta ^p}(x,x) + \sigma ^{2}I]^{-1}\log I(\mathbf {\nu }) \end{aligned}$$
(35)

and the proposed covariance is computed as follows

$$\begin{aligned} \hat{V^{p}} = k_{\theta ^p}(x',x')-k_{\theta ^p}(x',x)[k_{\theta ^p}(x,x) + \sigma ^{2}I]^{-1}k_{\theta ^p}(x,x').\nonumber \\ \end{aligned}$$
(36)

Having computed the proposed and current values of the log PSD, we update the current log PSD based on the Metropolis–Hastings transition kernel. First, we sample from a uniform distribution \(u \sim U(0,1)\) and compute our acceptance ratio,

$$\begin{aligned} \alpha = \min \left( 1, \frac{p(\log I(\mathbf {\nu }) \mid \theta ^{p}, {\hat{g}}^{p}) p({\hat{g}}^{p}) q({\hat{g}}^{c} \mid \log I(\mathbf {\nu }), \theta ^{p}))}{p(\log I(\mathbf {\nu }) \mid \theta ^{c}, {\hat{g}}^{c}) p({\hat{g}}^{c}) q({\hat{g}}^{p} \mid \log I(\mathbf {\nu }), \theta ^{c}))}\right) .\nonumber \\ \end{aligned}$$
(37)

\(p(\log I(\mathbf {\nu }) \mid \theta ^{p}, {\hat{g}}^{p})\) is our Whittle likelihood computation, the probability of the log PSD conditional on hyperparameters \(\theta \) and our candidate estimate of the latent log PSD \({\hat{g}}\). \(p({\hat{g}}^{p})\) represents the prior distribution on our latent log PSD and \(q({\hat{g}}^{c} \mid \log I(\mathbf {\nu }), \theta ^{p}))\) is our proposal distribution, representing the probability of the estimated log PSD conditional on the log periodogram and GP hyperparameters.

Should \(u < \alpha \), we update the current values of the log PSD mean and spectrum to the proposed values. That is,

$$\begin{aligned}&{\hat{g}}^{c+1} \xleftarrow []{} {\hat{g}}^{p} \end{aligned}$$
(38)
$$\begin{aligned}&{\hat{V}}^{c+1} \xleftarrow []{} {\hat{V}}^{p}. \end{aligned}$$
(39)

If \(u > \alpha \), both the mean and variance of the log PSD are kept at their current values,

$$\begin{aligned}&{\hat{g}}^{c+1} \xleftarrow []{} {\hat{g}}^{c} \end{aligned}$$
(40)
$$\begin{aligned}&{\hat{V}}^{c+1} \xleftarrow []{} {\hat{V}}^{c}. \end{aligned}$$
(41)

Importantly, modelling the log PSD with a GP prior does not mean that we are assuming a Gaussian error distribution around the spectrum. In actuality, proposed spectra are accepted and rejected through a Metropolis–Hastings procedure—resulting in log PSD samples being drawn from the true posterior distribution of the log PSD, \(\log (\exp (1))\). Having sampled the log PSD, we then accept/reject candidate GP hyperparameters with another Metropolis–Hastings step. Our acceptance ratio is,

$$\begin{aligned} \alpha = \min \left( 1, \frac{p(\log I(\mathbf {\nu }) \mid \theta ^{p}) p(\theta ^{c} \mid \theta ^{p})}{p(\log I(\mathbf {\nu }) \mid \theta ^{c}) p(\theta ^{p} \mid \theta ^{c})}\right) , \end{aligned}$$
(42)

where \(p(\log I(\mathbf {\nu }) \mid \theta )\) represents the Whittle likelihood modelling the probability of the log PSD, \(\log I(\mathbf {\nu })\), conditional on hyperparameters \(\theta \). \(p(\theta ^{p} \mid \theta )\) is the prior distribution we place over GP hyperparameters, \(\theta \). Note that in this particular case, the symmetric proposal distributions cancel out and our algorithm reduces simply to a Metropolis ratio. Again we follow the standard Metropolis–Hastings acceptance decision. If \(u < \alpha \),

$$\begin{aligned} \theta ^{c+1} \xleftarrow []{} \theta ^{p}, \end{aligned}$$
(43)

and the current hyperparameter values assume proposed values. Alternatively, if \(u > \alpha \),

$$\begin{aligned} \theta ^{c+1} \xleftarrow []{} \theta ^{c}, \end{aligned}$$
(44)

current value of the hyperparameters are not updated. Finally, following (Roberts and Rosenthal 2004) we implement an adaptive step-size within our random walk proposal. Every 50 iterations within our simulation, we compute the trailing acceptance ratio. An optimal acceptance ratio, \(\text {Acc}^{Opt}\) of 0.234 is targeted. If the acceptance ratio is too low, indicating that the step size may be too large, then the step size is systematically reduced. If \(\text {Acceptance Ratio}_{(j-49):j} \forall j \in \{50,100,150,...,10000\} < \text {Acc}^{Opt}\),

$$\begin{aligned} s^{2} \xleftarrow []{} s^{2} - \xi . \end{aligned}$$
(45)

If the acceptance ratio is too high, indicating that the step size may be too small then step size is systematically increased. That is, when \(\text {Acceptance Ratio}_{(j-49):j} \forall j \in \{50,100,...,10000\} < \text {Acc}^{Opt}\),

$$\begin{aligned} s^{2} \xleftarrow []{} s^{2} + \xi . \end{aligned}$$
(46)

Finally, the log PSD and the respective analytic uncertainty bounds are determined by computing the median of the samples generated from the sampling procedure,

$$\begin{aligned}&{\mathcal {U}}^{\text {final}}_{0.025} = \text {median}({\mathcal {U}}^{5000:10 000}_{0.025}) \end{aligned}$$
(47)
$$\begin{aligned}&{\hat{g}} = \text {median}({\hat{g}}^{5000:10 000}) \end{aligned}$$
(48)
$$\begin{aligned}&{\mathcal {U}}^{\text {final}}_{0.975} = \text {median}({\mathcal {U}}^{5000:10 000}_{0.975}). \end{aligned}$$
(49)

Appendix E: Turning point algorithm

In this section, we provide more details for the identification of non-trivial peaks (local maxima). We aim to outline a broad and flexible framework for this purpose, in which the exact procedure may be altered according to the specific application. For example, one way to determine peaks of a given spectral estimate is simply by inspection. We aim to provide an algorithmic framework as an alternative to this.

Let \({\mathbf {g}}\) be an analytic or estimated log power spectral density function. We may begin, if necessary, by applying additional smoothing to this function (though this step is optional and can be omitted). Following (James et al. 2022), we apply a two-step algorithm to the (possibly smoothed) function \({\mathbf {g}}\), defined on \(\nu _j=\frac{j}{n}, j=0,1,...,m-1\). The first step produces an alternating sequence of local minima (troughs) and local maxima (peaks), which may include some immaterial turning points. The second step refines this sequence according to chosen conditions and parameters. The most important conditions to initially identify a peak or trough, respectively, are the following:

$$\begin{aligned} g(\nu _{j_0})&=\max \{g(\nu _j): \max (0,j_0 - l) \le j \le \min (j_0 + l,m-1)\}, \end{aligned}$$
(50)
$$\begin{aligned} g(\nu _{j_0})&=\min \{g(\nu _j): \max (0,j_0 - l) \le t \le \min (j_0 + l,m-1)\}, \end{aligned}$$
(51)

where l is a parameter to be chosen. Defining peaks and troughs according to this definition alone has some flaws, such as the potential for two consecutive peaks.

Instead, we implement an inductive procedure to choose an alternating sequence of peaks and troughs. Suppose \(j_0\) is the last determined peak. We search in the period \(j>j_0\) for the first of two cases: if we find a time \(j_1>j_0\) that satisfies (51) as well as a non-triviality condition \(g(j_1)<g(j_0)\), we add \(j_1\) to the set of troughs and proceed from there. If we find a time \(j_1>j_0\) that satisfies (50) and \(g(t_0)\ge g(j_1)\), we ignore this lower peak as redundant; if we find a time \(j_1>j_0\) that satisfies (50) and \(g(j_1) > g(j_0)\), we remove the peak \(j_0\), replace it with \(j_1\) and continue from \(j_1\). A similar process applies from a trough at \(j_0\).

As a side remark, for an analytic log PSD \({\mathbf {g}}\), we could simply use the analytical and differentiable form to find critical points as an alternative.

With either possibility, at this point, the function is assigned an alternating sequence of troughs and peaks. However, some turning points are immaterial and should be removed. Here, the framework can incorporate a flexible series of options to refine the set of peaks.

As mentioned in Sect. 3.2, one relatively simple option is simply to remove any local maximum (peak) \({\hat{\rho }}\) of \(\hat{{\mathbf {g}}}\) with \({\hat{g}}({\hat{\rho }})<\max \hat{{\mathbf {g}}} - \delta \), for some sensible constant \(\delta \). In our experiments, the same results are produced for any \(\delta \in [2,4]\), demonstrating the robustness of this relatively simple idea. Under an affine transformation of the original time series \(X'_t = aX_t + b\), the log PSD \({\hat{g}}\) changes by an additive constant, so this condition is unchanged when rescaling the original data.

This relatively simple condition is sufficient for our application. For the benefit of future work, we list some alternative options for an algorithmic refinement of the peaks (besides, of course, inspection as the simplest option). In previous work, we have analysed functions \(\nu (t)\) that were necessarily valued only in non-negative reals. Thus, one may simply linearly the log PSD \({\mathbf {g}}\) so that its minimum value is zero. Then, numerous options exist for refinement of non-trivial peaks (and troughs).

For example, let \(t_1<t_3\) be two peaks, necessarily separated by a trough. We select a parameter \(\delta =0.2\), and if the peak ratio, defined as \(\frac{\nu (t_3)}{\nu (t_1)}<\delta \), we remove the peak \(t_3\). If two consecutive troughs \(t_2,t_4\) remain, we remove \(t_2\) if \(\nu (t_2)>\nu (t_4)\), otherwise remove \(t_4\). That is, if the second peak has size less than \(\delta \) of the first peak, we remove it. Alternatively, one may use this peak ratio on any peak, comparing it to the global max, rather than just comparing adjacent peaks. That is, let \(t_0\) be the global maximum. Then, one could remove any peak \(t_1\) with \(\frac{\nu (t_1)}{\nu (t_0)} < \delta \).

Alternatively, we use appropriately defined gradient or log-gradient comparisons between points \(t_1<t_2\). For example, let

$$\begin{aligned} {{\,\mathrm{log-grad}\,}}(t_1,t_2)=\frac{\log \nu (t_2) - \log \nu (t_1)}{t_2-t_1}. \end{aligned}$$
(52)

The numerator equals \(\log (\frac{\nu (t_2)}{\nu (t_1)})\), a “logarithmic rate of change”. Unlike the standard rate of change given by \(\frac{\nu (t_2)}{\nu (t_1)} -1\), the logarithmic change is symmetrically between \((-\infty ,\infty )\). Let \(t_1,t_2\) be adjacent turning points (one a trough, one a peak). We choose a parameter \(\epsilon \); if

$$\begin{aligned} |{{\,\mathrm{log-grad}\,}}(t_1,t_2)|<\epsilon , \end{aligned}$$
(53)

that is, the average logarithmic change is less than \(\epsilon \), we remove \(t_2\) from our sets of peaks and troughs. If \(t_2\) is not the final turning point, we also remove \(t_1\). After these refinement steps, we are left with an alternating sequence of non-trivial peaks and troughs. Finally, for this framework, we only need the peaks, so we simply discard the troughs.

As a final remark, only at the end is the final number r of non-trivial peaks determined. It is a function not only of the log PSD function \({\mathbf {g}}\), but also the precise conditions used to select and refine the (non-trivial) peaks.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

James, N., Menzies, M. Optimally adaptive Bayesian spectral density estimation for stationary and nonstationary processes. Stat Comput 32, 45 (2022). https://doi.org/10.1007/s11222-022-10103-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-022-10103-4

Keywords

Navigation