## Abstract

This article improves on existing Bayesian methods to estimate the spectral density of stationary and nonstationary time series assuming a Gaussian process prior. By optimising an appropriate eigendecomposition using a smoothing spline covariance structure, our method more appropriately models data with both simple and complex periodic structure. We further justify the utility of this optimal eigendecomposition by investigating the performance of alternative covariance functions other than smoothing splines. We show that the optimal eigendecomposition provides a material improvement, while the other covariance functions under examination do not, all performing comparatively well as the smoothing spline. During our computational investigation, we introduce new validation metrics for the spectral density estimate, inspired from the physical sciences. We validate our models in an extensive simulation study and demonstrate superior performance with real data.

### Similar content being viewed by others

## References

Adak, S.: Time-dependent spectral analysis of nonstationary time series. J. Am. Stat. Assoc.

**93**(444), 1488–1501 (1998). https://doi.org/10.1080/01621459.1998.10473808Barbe, K., Pintelon, R., Schoukens, J.: Welch method revisited: nonparametric power spectrum estimation via circular overlap. IEEE Trans. Signal Process.

**58**(2), 553–565 (2010). https://doi.org/10.1109/tsp.2009.2031724Box, G.E.P., Jenkins, G.M., Reinsel, G.C., Ljung, G.M.: Time Series Analysis: Forecasting and Control. Wiley, Hoboken (2015)

Brockwell, P.J., Davis, R.A.: Time Series: Theory and Methods. Springer, New York (1991). https://doi.org/10.1007/978-1-4419-0320-4

Carter, C.K., Kohn, R.: Semiparametric Bayesian inference for time series with mixed spectra. J. R. Stat. Soc.: Ser. B (Stat. Methodol.)

**59**(1), 255–268 (1997). https://doi.org/10.1111/1467-9868.00067Choudhuri, N., Ghosal, S., Roy, A.: Bayesian estimation of the spectral density of a time series. J. Am. Stat. Assoc.

**99**(468), 1050–1059 (2004). https://doi.org/10.1198/016214504000000557Cogburn, R., Davis, H.T.: Periodic splines and spectral estimation. Ann. Stat.

**2**(6), 1108–1126 (1974). https://doi.org/10.1214/aos/1176342868Dahlhaus, R.: Fitting time series models to nonstationary processes. Ann. Stat.

**25**(1), 1–37 (1997). https://doi.org/10.1214/aos/1034276620Duvenaud, D., Lloyd, J., Grosse, R., Tenenbaum, J., Ghahramani, Z.: Structure discovery in nonparametric regression through compositional kernel search. In: Proceedings of the 30th International Conference on Machine Learning, vol. 28, pp. 1166–1174 (2013)

Edwards, M.C., Meyer, R., Christensen, N.: Bayesian nonparametric spectral density estimation using B-spline priors. Stat. Comput.

**29**(1), 67–78 (2019). https://doi.org/10.1007/s11222-017-9796-9Eilers, P.H.C., Marx, B.D.: Flexible smoothing with B-splines and penalties. Stat. Sci.

**11**(2), 89–121 (1996). https://doi.org/10.1214/ss/1038425655Gangopadhyay, A., Mallick, B., Denison, D.: Estimation of spectral density of a stationary time series via an asymptotic representation of the periodogram. J. Stat. Plan. Inference

**75**(2), 281–290 (1999). https://doi.org/10.1016/s0378-3758(98)00148-7Green, P.J.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika

**82**(4), 711–732 (1995). https://doi.org/10.1093/biomet/82.4.711Gu, C.: Smoothing Spline ANOVA Models. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-5369-7

Guo, W., Dai, M., Ombao, H.C., von Sachs, R.: Smoothing spline ANOVA for time-dependent spectral analysis. J. Am. Stat. Assoc.

**98**(463), 643–652 (2003). https://doi.org/10.1198/016214503000000549Hadj-Amar, B., Rand, B.F., Fiecas, M., Lévi, F., Huckstepp, R.: Bayesian model search for nonstationary periodic time series. J. Am. Stat. Assoc.

**115**(531), 1320–1335 (2019). https://doi.org/10.1080/01621459.2019.1623043Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika

**57**(1), 97–109 (1970). https://doi.org/10.1093/biomet/57.1.97James, N., Menzies, M.: A new measure between sets of probability distributions with applications to erratic financial behavior. J. Stat. Mech: Theory Exp.

**2021**(12), 123404 (2021). https://doi.org/10.1088/1742-5468/ac3d91James, N., Menzies, M.: Collective correlations, dynamics, and behavioural inconsistencies of the cryptocurrency market over time. Nonlinear Dyn. (2022). https://doi.org/10.1007/s11071-021-07166-9

James, N., Menzies, M., Azizi, L., Chan, J.: Novel semi-metrics for multivariate change point analysis and anomaly detection. Physica D

**412**, 132636 (2020). https://doi.org/10.1016/j.physd.2020.132636James, N., Menzies, M., Bondell, H.: Comparing the dynamics of COVID-19 infection and mortality in the United States, India, and Brazil. Physica D

**432**, 133158 (2022). https://doi.org/10.1016/j.physd.2022.133158Lu, J., Hoi, S.C., Wang, J., Zhao, P., Liu, Z.Y.: Large scale online kernel learning. J. Mach. Learn. Res.

**17**(47), 1–43 (2016)Mann, M.E., Lees, J.M.: Robust estimation of background noise and signal detection in climatic time series. Clim. Change

**33**(3), 409–445 (1996). https://doi.org/10.1007/bf00142586Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys.

**21**(6), 1087–1092 (1953). https://doi.org/10.1063/1.1699114Paciorek, C.J., Schervish, M.J.: Nonstationary covariance functions for Gaussian process regression. In: Proceedings of the 16th International Conference on Neural Information Processing Systems, pp. 273–280. MIT Press (2003)

Percival, D.B., Walden, A.T.: Spectral Analysis for Physical Applications. Cambridge University Press, Cambridge (1993). https://doi.org/10.1017/cbo9780511622762

Plagemann, C., Kersting, K., Burgard, W.: Nonstationary Gaussian Process regression using point estimates of local smoothness. In: Machine Learning and Knowledge Discovery in Databases, pp. 204–219. Springer, Berlin, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87481-2_14

Prakash, A., James, N., Menzies, M., Francis, G.: Structural clustering of volatility regimes for dynamic trading strategies. Appl. Math. Finance

**28**(3), 236–274 (2021). https://doi.org/10.1080/1350486x.2021.2007146Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2005)

Roberts, G.O., Rosenthal, J.S.: General state space Markov chains and MCMC algorithms. Probab. Surv.

**1**, 20–71 (2004). https://doi.org/10.1214/154957804100000024Rosen, O., Stoffer, D.S., Wood, S.: Local spectral analysis via a Bayesian mixture of smoothing splines. J. Am. Stat. Assoc.

**104**(485), 249–262 (2009). https://doi.org/10.1198/jasa.2009.0118Rosen, O., Wood, S., Stoffer, D.: BayesSpec: Bayesian Spectral Analysis Techniques (2017). https://CRAN.R-project.org/package=BayesSpec. R package version 0.5.3

Rosen, O., Wood, S., Stoffer, D.S.: AdaptSPEC: Adaptive spectral estimation for nonstationary time series. J. Am. Stat. Assoc.

**107**(500), 1575–1589 (2012). https://doi.org/10.1080/01621459.2012.716340Thomson, D.: Spectrum estimation and harmonic analysis. Proc. IEEE

**70**, 1055–1096 (1982)Todd, J.F.: Recommendations for nomenclature and symbolism for mass spectroscopy. Int. J. Mass Spectrom. Ion Processes

**142**(3), 209–240 (1995). https://doi.org/10.1016/0168-1176(95)93811-fTodd, J.F.J.: Recommendations for nomenclature and symbolism for mass spectroscopy (including an appendix of terms used in vacuum technology). (recommendations 1991). Pure Appl. Chem.

**63**(10), 1541–1566 (1991). https://doi.org/10.1351/pac199163101541Wahba, G.: Automatic smoothing of the log periodogram. J. Am. Stat. Assoc.

**75**(369), 122–132 (1980). https://doi.org/10.1080/01621459.1980.10477441Wahba, G.: Spline Models for Observational Data. Society for Industrial and Applied Mathematics (1990). https://doi.org/10.1137/1.9781611970128

Wasserman, L.: All of Statistics: A Concise Course in Statistical Inference. Springer, New York (2004)

Whittle, P.: On stationary processes in the plane. Biometrika

**41**(3–4), 434–449 (1954). https://doi.org/10.1093/biomet/41.3-4.434Whittle, P.: Curve and periodogram smoothing. J. R. Stat. Soc.: Ser. B (Methodol.)

**19**(1), 38–47 (1957). https://doi.org/10.1111/j.2517-6161.1957.tb00242.xWilson, A.G., Adams, R.P.: Gaussian process kernels for pattern discovery and extrapolation. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, vol. 28, pp. 1067–1075 (2013)

Wood, S., Rosen, O., Kohn, R.: Bayesian mixtures of autoregressive models. J. Comput. Graph. Stat.

**20**(1), 174–195 (2011). https://doi.org/10.1198/jcgs.2010.09174Wood, S.A., Jian, W., Tanner, M.: Bayesian mixture of splines for spatially adaptive nonparametric regression. Biometrika

**89**(3), 513–528 (2002). https://doi.org/10.1093/biomet/89.3.513Wood, S.N.: P-splines with derivative based penalties and tensor product smoothing of unevenly distributed data. Stat. Comput.

**27**(4), 985–989 (2017). https://doi.org/10.1007/s11222-016-9666-x

## Acknowledgements

Many thanks to Lamiae Azizi, Sally Cripps and Alex Judge for helpful discussions.

## Author information

### Authors and Affiliations

### Corresponding author

## Ethics declarations

### Conflict of interest

The authors declare that they have no conflict of interest.

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### Appendix A: Percival–Walden AR process

In this brief section, we apply our methodology to a rather challenging autoregressive process that has been highlighted several times in the literature (Box et al. 2015; Percival and Walden 1993) and is commonly known as the Percival–Walden AR(4). This process is defined as \(x_{t} = 2.7607 x_{t-1} - 3.8106 x_{t-2} + 2.6535 x_{t-3} -0.9238 x_{t-4} + \epsilon _{t}\), simulated over length \(n=1024.\) As in Sect. 4, we simulate the process and validate our spectral density estimates against the known analytic power spectrum. In this experiment, the optimal and penalised optimal smoothing spline coincide, with 23 eigenvectors. The spectral estimates are plotted in Fig. 5 while validation metrics are provided in Table 4.

This experiment also provides an example where the proximity matching criterion of Sect. 3.2 fails. We can simply observe that the analytic log power spectrum \({\mathbf {g}}\) has two peaks, while the spectral estimate \(\hat{{\mathbf {g}}}\) for both \(\hbox {Spline}_{{10}}\) and the (penalised) optimal smoothing spline have one peak each. As such, Table 4 includes the values of the semi-metric presented in (24) in Sect. 3.3. We observe that the (penalised) optimal smoothing spline provides a better approximation of the amplitudes of the two peaks than the existing method of Rosen et al.

### Appendix B: Discussion of select existing methods

Alongside the statistics community, many signal processing practitioners and engineers have long been interested in the study of time series’ power spectra. Thus, it is worth noting the differences between a framework such as ours and frequentist or signal processing-based methods for power spectral density estimation.

We begin by describing Welch’s method, which is based upon Barlett’s method. Welch’s method aims to reduce noise in the resulting power spectral density estimate by sacrificing the degree of frequency resolution. The data are subdivided into overlapping segments, in each of which, a modified periodogram is computed. Each modified periodogram is averaged to produce a final estimate of the power spectral density. There are two model parameters in Welch’s method: the length of each segment and the degree of overlap in data points between adjacent segments.

Practically, Welch’s method has several limitations in comparison to Bayesian methods. First, the estimated power spectral density when using methods such as these may be less smooth (though not uniformly so). For scientists hoping to make observations with respect to the maximum amplitude and corresponding frequencies for any underlying time series, the often rough nature of Welch’s method may make inference more difficult. It is common in the Bayesian statistics literature to use a flexible prior function on the log of the power spectrum, such as a Gaussian process. The smoothness of the Gaussian process may be highly dependent on the covariance structure chosen by the modeller. Applying covariance functions such as the squared exponential and select Matern family variants allows for smooth interpolation in the resulting power spectral density estimate. Furthermore, recent research has shown that the variance is not a monotonic decreasing function of the fraction of overlap within adjacent segments (Barbe et al. 2010). Second, Welch’s method is unable to algorithmically partition the time series based on changes in the power spectral density. Procedures such as the RJMCMC introduced in this paper identify points in time where the power spectral density has changed. Hence, one would be unable to determine locations in the time domain which correspond to changes in the underlying periodic nature of a process, if one were to use Welch’s method.

That said, many practitioners in the signal processing literature use techniques such as wavelets in the case of implementing spectral density estimation in a nonstationary setting. For instance, the continuous wavelet transform has been applied for spectral analysis in nonstationary signals. Wavelets overcome the obvious limitation with Fourier transformation-driven methods, where abrupt changes in time series’ behaviour is difficult to capture (due to its underlying construction as a sum of sinusoidal waves). Unlike sine waves, which smoothly oscillate, wavelets are derived from “step functions” that exist for a finite duration, allowing for the efficient capture of abrupt changes in modelling tasks.

Third, many would argue that a Bayesian framework such as ours provides a more principled approach to uncertainty quantification than frameworks such as Welch’s method. The methodology proposed in this paper consists of uncertainty surrounding the power spectral density estimate, in addition to uncertainty surrounding the change point location. One clear advantage of Welch’s method in comparison with the method we have proposed (and other MCMC-based methods), however, is a significant computational advantage. While there are certainly frequentist methods to estimate the uncertainty in traditional signal processing methods, there are always individuals who prefer the posterior distributions provided by Bayesian methods, including just from a psychological perspective, including the ability to make probabilistic statements about unknown parameters (Wasserman 2004).

Another commonly used framework for spectral density estimation is the multitaper method. Multitaper analysis is an extensions of traditional taper analysis, where time series are tapered before applying a Fourier transformation as a method of reducing potential bias resulting from spectral leakage. The multitaper method averages over a variety of estimators with varying window functions (Thomson 1982; Mann and Lees 1996). This results in a power spectrum that exhibits reduced leakage and variance and retains important information from the initial and final sequences from the underlying time series. One major advantage of the multitaper method is that it can be applied in a fairly automatic manner, and is therefore appropriate in situations where many individual time series must be processed and a thorough analysis of each individual time series is not feasible. One possible limitation of the multitaper method is reduced spectral resolution. The multitaper method has proved to be an effective estimator in the presence of complex spectra. For example, Percival and Walden (1993) highlight the estimator’s effectiveness in detecting two peaks in the case of their AR(4) process described in Appendix A. As we saw, our methodology was unable to detect the two peaks. Of course, there are many techniques currently being used in addition to Welch’s method and the multitaper method described above. The choice between frequentist and Bayesian methods may depend on the precise problem and even the philosophical outlook of the practitioner. The literature is enriched by a robust continual development of both approaches.

### Appendix C: Reversible jump sampling scheme

We follow Rosen et al. (2017, 2012) in our core implementation of the reversible jump sampling scheme. We remark that our method does not improve the trans-dimensional component of the model, described by the reversible jump scheme below. A time series partition is denoted \({\varvec{\xi }}_{m} = (\xi _{0,m},...,\xi _{m,m})\) with *m* segments. We have a vector of *amplitude parameters* \({\varvec{\tau }}_{m}^{2} = (\tau _{1,m}^{2},...,\tau _{m,m}^{2})'\) and *regression coefficients* \({\varvec{\beta }}_{m} = ({\varvec{\beta '}}_{1,m},...,{\varvec{\beta '}}_{m,m})\) that we wish to estimate, for the *j*th component within a partition of *m* segments, \(j=1,...,m.\) For notational simplicity, \({\varvec{\beta }}_{j,m}, j=1,...,m,\) is assumed to include the first entry, \(\alpha _{0j,m}.\) In the proceeding sections, superscripts *c* and *p* refer to current and proposed value in the sampling scheme.

First, we describe the **between-model moves:** let \({\varvec{\theta }}_{m} = ({\varvec{\xi }}'_{m}, {\varvec{\tau }}^{2'}_{m}, {\varvec{\beta '}}_{m})\) be the model parameters at some point in the sampling scheme and assume that the chain starts at \((m^c, {\varvec{\theta }}_{m^c}^{c})\). The algorithm proposes the move to \((m^p, {\varvec{\theta }}_{m^p}^p)\), by drawing \((m^p, {\varvec{\theta }}_{m^p}^p)\) from the proposal distribution

\(q(m^p, {\varvec{\theta }}_{m^p}^p|m^c, {\varvec{\theta }}_{m^c}^c)\). That draw is accepted with probability

with \(p(\cdot )\) referring to a target distribution, the product of the likelihood and the prior. The target and proposal distributions will vary based on the type of move taken in the sampling scheme. First, \(q(m^p, {\varvec{\theta }}_{m^{p}}^{p}| m^c, {\varvec{\theta }}_{m^{c}}^{c})\) is described as follows:

To draw \((m^p, {\varvec{\theta }}_{m^p}^p)\) one must first draw \(m^p\), followed by \({\varvec{\xi }}_{m^p}^p\), \(\tau _{m^p}^{2p}, \text { and } {\varvec{\beta }}_{m^p}^{p}\). First, the number of segments \(m^p\) is drawn from the proposal distribution \(q(m^p|m^c)\). Let *M* be the maximum number of segments and \(m^{c}_{2,\text {min}}\) be the number of current segments whose cardinality is at least \(2 t_{min}\) data points. The proposal is as follows:

Conditional on the proposed model \(m^p\), a new partition \({\varvec{\xi }}_{m^p}^{p}\), a new vector of covariance amplitude parameters \({\varvec{\tau }}_{m^p}^{2p}\) and a new vector of regression coefficients, \({\varvec{\beta }}_{m^p}^{p}\) are proposed. In Rosen et al. (2012), \(\tau ^{2}\) is referred to as a smoothing parameter. To impact the smoothness of the covariance function, the parameter would have to impact pairwise operations. Given that \(\tau ^{2}\) sits outside the covariance matrix, we will refer to \(\tau ^{2}\) as an amplitude parameter (akin to signal variance within the Gaussian process framework Rasmussen and Williams 2005).

Now, we describe the process of the **birth** of new segments. Suppose that \(m^p = m^c + 1\). A time series partition,

is drawn from the proposal distribution \(q({\varvec{\xi }}_{m^p}^{p}|m^p, m^c, {\varvec{\theta }}_{m^c}^c)\). The algorithm proposes a partition by first selecting a random segment \(j = k^{*}\) to split. Then, a point \(t^{*}\) within the segment \(j=k^{*}\) is randomly selected to be the proposed partition point. This is subject to the constraint,

\(\xi _{k^{*}-1, m^c}^{c} + t_{\text {min}} \le t* \le \xi _{k^{*},m^c}^c - t_{\text {min}}\). The proposal distribution is computed as follows:

The vector of amplitude parameters

is drawn from the proposal distribution

\(q({\varvec{\tau }}_{m^p}^{2p}|m^p, {\varvec{\xi }}_{m^p}^p, m^c, {\varvec{\theta }}_{m^c}^c) = q({\varvec{\tau }}_{m^p}^{2p}|m^p, {\varvec{\tau }}_{m^c}^{2c}).\) The algorithm is based on the reversible jump algorithm of Green (1995). It draws from a uniform distribution \(u \sim U[0,1]\) and defines \(\tau _{k^{*}, m^p}^{2p}\) and \(\tau _{k^{*}+1, m^p}^{2p}\) in terms of *u* and \(\tau _{k^{*}, m^c}^{2c}\) as follows:

The vector of coefficients

is drawn from the proposal distribution

\(q({\varvec{\beta }}_{m^p}^p|{\varvec{\tau }}_{m^p}^{2p},{\varvec{\xi }}_{m^p}^{2p},m^p, m^c, {\varvec{\theta }}_{m^c}^c) = q({\varvec{\beta }}_{m^p}^{p}|{\varvec{\tau }}_{m^p}^{2p}, {\varvec{\xi }}_{m^p}^p, m^p)\). The pair of vectors \({\varvec{\beta }}_{k^{*}, m^p}^p\) and \({\varvec{\beta }}_{k^{*}+1, m^p}^p\) are drawn from Gaussian approximations to the respective posterior conditional distributions \(p({\varvec{\beta }}_{k^{*}, m^p}^p|{\varvec{x}}_{k^{*}}^p, \tau _{k^{*}, m^p}^{2p}, m^p)\) and

\(p({\varvec{\beta }}_{k^{*}+1, m^p}^{p}|{\varvec{x}}_{k^{*}+1}^p, \tau _{k^{*}+1, m^p}^{2p}, m^p)\), respectively. Here, \({\varvec{x}}_{k^{*}}^p\) and \({\varvec{x}}_{k^{*}+1}^p\) refer to the subsets of the time series with respective segments \(k^{*}\) and \(k^{*}+1\). \({\varvec{\xi }}_{m^p}^p\) will determine \({\varvec{x_{*}}}^p = ({\varvec{x}}_{k^{*}}^{p'}, {\varvec{x}}_{k^{*}+1}^{p'})'\). For the sake of exposition, we provide the following example: the coefficient \({\varvec{\beta }}_{k^{*}, m^p}^{p}\) is drawn from the Gaussian distribution \(N({\varvec{\beta }}_{k^{*}}^{\text {max}}, \Sigma _{k^{*}}^{\text {max}})\), where \({\varvec{\beta }}_{k^{*}}^{\text {max}}\) is defined as

and

For the birth move, the probability of acceptance is \(\alpha = \min \{1,A\}\), where *A* is equal to

Above, \(p(u) = 1, 0 \le u \le 1,\) while \(p({\varvec{\beta }}_{k^{*}, m^p}^{p})\) and \(p({\varvec{\beta }}_{k^{*}+1, m^p}^{p})\) are Gaussian proposal distributions \(N({\varvec{\beta }}_{k^{*}}^{\text {max}}, \Sigma _{k^{*}}^{\text {max}})\) and

\(N({\varvec{\beta }}_{k^{*}+1}^{\text {max}}, \Sigma _{k^{*}+1}^{\text {max}})\), respectively. The Jacobian is computed as

Next, we describe the process of the **death** of new segments, that is, the reverse of a birth move, where \(m^p = m^c - 1\). A time series partition

is proposed by randomly selecting a single partition from \(m^c - 1\) candidates, and removing it. The partition point selected for removal is denoted \(j=k^{*}\). There are \(m^c -1\) possible segments available for removal among the \(m^c\) segments currently in existence. The proposal may choose each partition point with equal probability, that is,

The vector of amplitude parameters

is drawn from the proposal distribution

\(q({\varvec{\tau }}_{m^p}^{2p}|m^p, {\varvec{\xi }}_{m^p}^p, m^c, {\varvec{\theta }}_{m^c}^{c}) = q({\varvec{\tau }}_{m^p}^{2p}| m^p, {\varvec{\tau }}_{m^c}^{2c})\). One amplitude parameter \(\tau _{k^{*}, m^p}^{2p}\) is formed from two candidate amplitude parameters, \(\tau _{k^{*},m^c}^{2c}\) and \(\tau _{k^{*}+1,m^c}^{2c}\). This is done by reversing the equations 27 and 28. That is,

Finally, the vector of regression coefficients,

is drawn from the proposal distribution

\(q({\varvec{\beta }}_{m^p}^p|{\varvec{\tau }}_{m^p}^{2p}, {\varvec{\xi }}_{m^p}^p, m^p, m^c, \theta _{m^c}^c) = q({\varvec{\beta }}_{m^p}^{p}|{\varvec{\tau }}_{m^p}^{2p}, {\varvec{\xi }}_{m^p}^p, m^p)\). The vector of regression coefficients is drawn from a Gaussian approximation to the posterior distribution

\(p(\beta _{k^{*},m^p}|{\varvec{x}}, \tau _{k^{*}, m^p}^{2p}, {\varvec{\xi }}^p_{m^p}, m^p)\) following the same procedure for the vector of coefficients in the birth step. The probability of acceptance is the inverse of the analogous birth step. If the move is accepted then the following updates occur: \(m^c=m^p\) and \({\varvec{\theta }}_{m^c}^c = {\varvec{\theta }}_{m^p}^{p}\).

Finally, we describe the **within-model moves:** henceforth, *m* is fixed; accordingly, notation describing the dependence on the number of segments is removed. There are two parts to a within-model move. First, a segment relocation is performed, and conditional on the relocation, the basis function coefficients are updated. The steps are jointly accepted or rejected with a Metropolis–Hastings step. The amplitude parameters are updated within a separate Gibbs sampling step.

The chain is assumed to be located at \({\varvec{\theta }}^{c} = ({\varvec{\xi }}^{c}, {\varvec{\beta }}^{c})\). The proposed move \({\varvec{\theta }}^p = ({\varvec{\xi }}^p, {\varvec{\beta }}^p)\) is as follows: first, a partition point \(\xi _{k^{*}}\) is selected for relocation from \(m-1\) candidate partition points. Next, a position within the interval \([\xi _{k^{*}-1}, \xi _{k^{*}+1}]\) is selected, subject to the fact that the new location is at least \(t_{\text {min}}\) data points away from \(\xi _{k^{*}-1}\) and \(\xi _{k^{*}+1}\), so that

where \(\Pr (j=k^{*}) = (m-1)^{-1}\). A mixture distribution for \(\Pr (\xi _{k^{*}}^p=t|j=k^{*})\) is constructed to explore the space most efficiently, so

where \(q_1(\xi _{k^{*}}^p = t| \xi _{k^{*}}^c) = (n_{k^{*}} + n_{k^{*}+1}-2t_{\text {min}} + 1)^{-1}, \xi _{k^{*}-1} + t_{\text {min}} \le t \le \xi _{k^{*}+1} - t_{\text {min}}\) and

The support of \(q_1\) has \(n_{k^{*}} + n_{k^{*}+1} - 2t_{\text {min}} + 1\) data points while \(q_2\) has at most three. The term \(q_2\) alone would result in a high acceptance rate for the Metropolis–Hastings, but it would explore the parameter space slowly. The \(q_1\) component allows for larger jumps, and produces a compromise between a high acceptance rate and thorough exploration of the parameter space.

Next, \({\varvec{\beta ^{p}_{j}}}, j=k^{*}, k^{*}+1\) is drawn from an approximation to \(\prod ^{k^{*}+1}_{j=k^{*}} p({\varvec{\beta }}_j|{\varvec{x}}_j^p, \tau _j^{2})\), following the analogous step in the between-model move. The proposal distribution, which is evaluated at \({\varvec{\beta }}^{p}_j, j=k^{*}, k^{*}=1\), is

where \({\varvec{\beta }}_{*}^p = ({\varvec{\beta }}^{p'}_{k^{*}}, {\varvec{\beta }}^{p'}_{k^{*}+1})'\) and \({\varvec{\tau }}_{*}^{2} = (\tau ^{2}_{k^{*}}, \tau ^{2}_{k^{*}+1})'\). The proposal distribution is evaluated at current values of \({\varvec{\beta }}_{*}^{c} = (\beta ^{c'}_{k^{*}}, \beta ^{c'}_{k^{*}+1})'\). \(\beta _{*}^p\) is accepted with probability

where \({\varvec{x}}_{*}^{c} = ({\varvec{x}}^{c'}_{k^{*}},{\varvec{x}}^{c'}_{k^{*}+1})\). When the draw is accepted, update the partition and regression coefficients \((\xi ^{c}_{k^{*}}, \beta _{*}^{c}) = (\xi ^{p}_{k^{*}}, \beta _{*}^{p})\). Finally, draw \(\tau ^{2p}\) from

This is a Gibbs sampling step, and accordingly the draw is accepted with probability 1.

### Appendix D: Metropolis–Hastings algorithm

In this section, we describe the Metropolis–Hastings algorithm used in the stationary case for our simulation study (Sect. 4). As seen in Appendix C, the above RJMCMC reduces to a Metropolis–Hastings in the absence of the between-model moves.

We estimate the log of the spectral density by its posterior mean via a Bayesian approach and an adaptive MCMC algorithm:

Here, *M* is the number of post-burn-in iterations in the MCMC scheme; \(\mathbf {\theta }^{j}\) are samples taken from the posterior distribution \(p(\mathbf {\theta }|{\mathbf {y}})\); \(p(\mathbf {\theta })\) is taken from a Gaussian distribution \(N(\mu , \sigma ^{2})\) centred around \(\mu =\theta ^{[c]}\) that maximises the log marginal likelihood; \(\sigma \) is chosen arbitrarily; and \(\hat{{\mathbf {g}}}\) is the forecast log spectrum.

Monte Carlo algorithms have been highly prominent for estimation of hyperparameters and spectral density in a nonparametric Bayesian framework. Metropolis et al. (1953) first proposed the Metropolis algorithm; this was generalised by Hastings in a more focused, statistical context (Hastings 1970). The random walk Metropolis–Hastings algorithm aims to sample from a target density \(\pi \), given some candidate transition probability function *q*(*x*, *y*). In our context, \(\pi \) represents the Whittle likelihood function multiplied by respective priors. The acceptance ratio is:

Our MCMC scheme calculates the acceptance ratio every 50 iterations; based on an ideal acceptance ratio (Roberts and Rosenthal 2004), the step size is adjusted for each hyperparameter. The power spectrum and hyperparameters for the GP covariance function are calculated in each iteration of our sampling scheme and integrated over.

First, we initialise the values of our GP hyperparameters, \(\theta ^{[c]}\), our log PSD \({\hat{g}}^{c}\), our random perturbation size \(s^{2}\), and the adaptive adjustment for our step size \(\xi \). Starting values for our GP hyperparameters are chosen based on maximising the marginal Whittle likelihood,

The latent log PSD is modelled with a zero mean GP, That is, \({\mathbf {g}} \sim GP(0,k_{\theta }(x,x'))\), and we follow the notation of Rasmussen and Williams (2005) where \(k(x,x')\) refers to any respective kernel. The adaptive MCMC algorithm samples from the posterior distribution of the log PSD and the posterior distribution of any candidate covariance function’s hyperparameters. First, the current and proposed values for the mean and covariance of the Gaussian process are computed. That is, the mean is computed:

and the covariance is computed:

New proposals for GP hyperparameters are determined via a random walk proposal distribution. A zero-mean Gaussian distribution is used to generate candidate perturbations, where \(s^2\) is the variance of this Gaussian. That is

Having drawn the proposed GP hyperparameters, a proposed mean and covariance function of the log PSD are drawn from the posterior distribution of the GP. Both the proposed mean and covariance are computed similarly to the current values, simply replacing the values of the hyperparameters \(\theta ^{c} \xleftarrow []{} \theta ^{p}\). So, the proposed mean of the log PSD is,

and the proposed covariance is computed as follows

Having computed the proposed and current values of the log PSD, we update the current log PSD based on the Metropolis–Hastings transition kernel. First, we sample from a uniform distribution \(u \sim U(0,1)\) and compute our acceptance ratio,

\(p(\log I(\mathbf {\nu }) \mid \theta ^{p}, {\hat{g}}^{p})\) is our Whittle likelihood computation, the probability of the log PSD conditional on hyperparameters \(\theta \) and our candidate estimate of the latent log PSD \({\hat{g}}\). \(p({\hat{g}}^{p})\) represents the prior distribution on our latent log PSD and \(q({\hat{g}}^{c} \mid \log I(\mathbf {\nu }), \theta ^{p}))\) is our proposal distribution, representing the probability of the estimated log PSD conditional on the log periodogram and GP hyperparameters.

Should \(u < \alpha \), we update the current values of the log PSD mean and spectrum to the proposed values. That is,

If \(u > \alpha \), both the mean and variance of the log PSD are kept at their current values,

Importantly, modelling the log PSD with a GP prior does not mean that we are assuming a Gaussian error distribution around the spectrum. In actuality, proposed spectra are accepted and rejected through a Metropolis–Hastings procedure—resulting in log PSD samples being drawn from the true posterior distribution of the log PSD, \(\log (\exp (1))\). Having sampled the log PSD, we then accept/reject candidate GP hyperparameters with another Metropolis–Hastings step. Our acceptance ratio is,

where \(p(\log I(\mathbf {\nu }) \mid \theta )\) represents the Whittle likelihood modelling the probability of the log PSD, \(\log I(\mathbf {\nu })\), conditional on hyperparameters \(\theta \). \(p(\theta ^{p} \mid \theta )\) is the prior distribution we place over GP hyperparameters, \(\theta \). Note that in this particular case, the symmetric proposal distributions cancel out and our algorithm reduces simply to a Metropolis ratio. Again we follow the standard Metropolis–Hastings acceptance decision. If \(u < \alpha \),

and the current hyperparameter values assume proposed values. Alternatively, if \(u > \alpha \),

current value of the hyperparameters are not updated. Finally, following (Roberts and Rosenthal 2004) we implement an adaptive step-size within our random walk proposal. Every 50 iterations within our simulation, we compute the trailing acceptance ratio. An optimal acceptance ratio, \(\text {Acc}^{Opt}\) of 0.234 is targeted. If the acceptance ratio is too low, indicating that the step size may be too large, then the step size is systematically reduced. If \(\text {Acceptance Ratio}_{(j-49):j} \forall j \in \{50,100,150,...,10000\} < \text {Acc}^{Opt}\),

If the acceptance ratio is too high, indicating that the step size may be too small then step size is systematically increased. That is, when \(\text {Acceptance Ratio}_{(j-49):j} \forall j \in \{50,100,...,10000\} < \text {Acc}^{Opt}\),

Finally, the log PSD and the respective analytic uncertainty bounds are determined by computing the median of the samples generated from the sampling procedure,

### Appendix E: Turning point algorithm

In this section, we provide more details for the identification of non-trivial peaks (local maxima). We aim to outline a broad and flexible framework for this purpose, in which the exact procedure may be altered according to the specific application. For example, one way to determine peaks of a given spectral estimate is simply by inspection. We aim to provide an algorithmic framework as an alternative to this.

Let \({\mathbf {g}}\) be an analytic or estimated log power spectral density function. We may begin, if necessary, by applying additional smoothing to this function (though this step is optional and can be omitted). Following (James et al. 2022), we apply a two-step algorithm to the (possibly smoothed) function \({\mathbf {g}}\), defined on \(\nu _j=\frac{j}{n}, j=0,1,...,m-1\). The first step produces an alternating sequence of local minima (troughs) and local maxima (peaks), which may include some immaterial turning points. The second step refines this sequence according to chosen conditions and parameters. The most important conditions to initially identify a peak or trough, respectively, are the following:

where *l* is a parameter to be chosen. Defining peaks and troughs according to this definition alone has some flaws, such as the potential for two consecutive peaks.

Instead, we implement an inductive procedure to choose an alternating sequence of peaks and troughs. Suppose \(j_0\) is the last determined peak. We search in the period \(j>j_0\) for the first of two cases: if we find a time \(j_1>j_0\) that satisfies (51) as well as a non-triviality condition \(g(j_1)<g(j_0)\), we add \(j_1\) to the set of troughs and proceed from there. If we find a time \(j_1>j_0\) that satisfies (50) and \(g(t_0)\ge g(j_1)\), we ignore this lower peak as redundant; if we find a time \(j_1>j_0\) that satisfies (50) and \(g(j_1) > g(j_0)\), we remove the peak \(j_0\), replace it with \(j_1\) and continue from \(j_1\). A similar process applies from a trough at \(j_0\).

As a side remark, for an analytic log PSD \({\mathbf {g}}\), we could simply use the analytical and differentiable form to find critical points as an alternative.

With either possibility, at this point, the function is assigned an alternating sequence of troughs and peaks. However, some turning points are immaterial and should be removed. Here, the framework can incorporate a flexible series of options to refine the set of peaks.

As mentioned in Sect. 3.2, one relatively simple option is simply to remove any local maximum (peak) \({\hat{\rho }}\) of \(\hat{{\mathbf {g}}}\) with \({\hat{g}}({\hat{\rho }})<\max \hat{{\mathbf {g}}} - \delta \), for some sensible constant \(\delta \). In our experiments, the same results are produced for any \(\delta \in [2,4]\), demonstrating the robustness of this relatively simple idea. Under an affine transformation of the original time series \(X'_t = aX_t + b\), the log PSD \({\hat{g}}\) changes by an additive constant, so this condition is unchanged when rescaling the original data.

This relatively simple condition is sufficient for our application. For the benefit of future work, we list some alternative options for an algorithmic refinement of the peaks (besides, of course, inspection as the simplest option). In previous work, we have analysed functions \(\nu (t)\) that were necessarily valued only in non-negative reals. Thus, one may simply linearly the log PSD \({\mathbf {g}}\) so that its minimum value is zero. Then, numerous options exist for refinement of non-trivial peaks (and troughs).

For example, let \(t_1<t_3\) be two peaks, necessarily separated by a trough. We select a parameter \(\delta =0.2\), and if the *peak ratio*, defined as \(\frac{\nu (t_3)}{\nu (t_1)}<\delta \), we remove the peak \(t_3\). If two consecutive troughs \(t_2,t_4\) remain, we remove \(t_2\) if \(\nu (t_2)>\nu (t_4)\), otherwise remove \(t_4\). That is, if the second peak has size less than \(\delta \) of the first peak, we remove it. Alternatively, one may use this peak ratio on any peak, comparing it to the global max, rather than just comparing adjacent peaks. That is, let \(t_0\) be the global maximum. Then, one could remove any peak \(t_1\) with \(\frac{\nu (t_1)}{\nu (t_0)} < \delta \).

Alternatively, we use appropriately defined gradient or log-gradient comparisons between points \(t_1<t_2\). For example, let

The numerator equals \(\log (\frac{\nu (t_2)}{\nu (t_1)})\), a “logarithmic rate of change”. Unlike the standard rate of change given by \(\frac{\nu (t_2)}{\nu (t_1)} -1\), the logarithmic change is symmetrically between \((-\infty ,\infty )\). Let \(t_1,t_2\) be adjacent turning points (one a trough, one a peak). We choose a parameter \(\epsilon \); if

that is, the average logarithmic change is less than \(\epsilon \), we remove \(t_2\) from our sets of peaks and troughs. If \(t_2\) is not the final turning point, we also remove \(t_1\). After these refinement steps, we are left with an alternating sequence of non-trivial peaks and troughs. Finally, for this framework, we only need the peaks, so we simply discard the troughs.

As a final remark, only at the end is the final number *r* of non-trivial peaks determined. It is a function not only of the log PSD function \({\mathbf {g}}\), but also the precise conditions used to select and refine the (non-trivial) peaks.

## Rights and permissions

## About this article

### Cite this article

James, N., Menzies, M. Optimally adaptive Bayesian spectral density estimation for stationary and nonstationary processes.
*Stat Comput* **32**, 45 (2022). https://doi.org/10.1007/s11222-022-10103-4

Received:

Accepted:

Published:

DOI: https://doi.org/10.1007/s11222-022-10103-4