Towards robust data-driven reduced-order modelling for turbulent flows: application to vortex-induced vibrations

This work presents a robust method that minimises the impact of user-selected parameter on the identification of generic models to study the coherent dynamics in turbulent flows. The objective is to gain insight into the flow dynamics from a data-driven reduced order model (ROM) that is developed from measurement data of the respective flow. For an efficient separation of the coherent dynamics, spectral proper orthogonal decomposition (SPOD) is used, projecting the flow field onto a low-dimensional subspace, so that the dominating dynamics can be represented with a minimal number of modes. A function library is defined using polynomial combinations of the temporal modal coefficients to describe the flow dynamics with a system of nonlinear ordinary differential equations. The most important library functions are identified in a two-stage cross-validation procedure (conservative and restrictive sparsification) and combined in the final model. In the first stage, the process uses a simple approximation of the derivative to match the model with the data. This stage delivers a reduced set of possible library function candidates for the model. In the second, more complex stage, the model of the entire flow is integrated over a short time and compared with the progression of the measured data. This restrictive stage allows a robust identification of nonlinearities and modal interactions in the data and their representation in the model. The method is demonstrated using data from particle image velocimetry (PIV) measurements of a circular cylinder undergoing vortex-induced vibration (VIV) at \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{Re}=4000$$\end{document}Re=4000. It delivers a reduced order model that reproduces the average dynamics of the flow and reveals the interaction of coexisting flow dynamics by the model structure.


Introduction
In many engineering applications, the prediction and control of highly energetic coherent flow structures are important in designing, monitoring and optimising fluid systems. Most flows of practical interest are turbulent, and many strategies aim to gain insight into the flow dynamics by approximating only the coherent dynamics of the system with a reduced-order model (ROM). Over the last decades, refinements in experimental and numerical simulation methods have enabled access to very large amounts of high-fidelity data, making data-driven approaches an increasingly interesting choice for deriving ROMs [10]. However, the stochastic with the Navier-Stokes equations (N ), the basis of the problem is changed, using a modal decomposition to approximate the instantaneous velocity field with only a few basis functions. Then, the evolution of the temporal coefficients a i (t) is represented with a nonlinear operator F. In a final step, the approximated field is projected back to the canonical basis.
The general process can be divided into two main steps: first, reducing the dimensionality of the input data by a modal decomposition and, second, building robust ROM from data to approximate the nonlinear operator F (sparsification procedure). A flow chart summarizing the steps of the two sparsification algorithms is shown in Fig. 2.

Flow field decomposition and mode coupling
In this section, the essential steps of the data decomposition (SPOD) and a method to find coupled modes are presented following Sieber et al. [35]. Data, either measured (e. g. PIV) or computed (CFD), are recorded as snapshots (M spatial points over N time steps).
For PIV and CFD data, usually M ≥ N and the initial decomposition steps are more suitably carried out with the snapshot POD [37]. The POD projects the data onto an orthogonal base that is optimal with respect to the representation of kinetic energy of fluctuations, so that the most energetic structures can be represented by a minimal number of basis functions. As a first step, the velocity field v(x, t) is decomposed into a mean velocity field v(x) and fluctuations v (x, t). For the VIV data, the flow shows a spatial symmetry. Following Holmes et al. [10], this property is exploited and the decomposition is performed for a symmetrised flow field using the procedure described in "Appendix A". By reinforcing natural symmetries of the modes, the convergence rates are increased with respect to the number of employed snapshots [3]. Note that this procedure also optimises the spatial convergence of the later applied filter operation of the SPOD.
Thus, as shown in Fig. 1, a decomposition of v(x, t) in the form is sought, where i (x) are the spatial modes and b i (t) the corresponding temporal coefficients. The modal basis is constructed from correlated observations of fluctuations in the flow. Therefore, the correlation between individual snapshots is computed using the L 2 inner product: where specifies the spatial region (PIV window) of integration. The elements of the covariance matrix C represent the covariance over all snapshots and are given by At this point, the only additional step of SPOD is applied. Instead of directly deriving the temporal coefficients with an eigendecomposition of C, an additional filter operation is applied to the covariance matrix, which requires the selection of a filter. The filter is specified by the size of a discrete filter in terms of sampling points N f , which is equivalent to a short time span N f Δt given the sampling interval of the signal Δt. The elements of the resulting filter covariance matrix W i j are given by Here, a Gaussian filter is applied, where the vector elements g k are: There are three options to deal with boundary points: (i) Periodic boundary conditions are imposed, if the periodic property applies to the investigated flow. (ii) Invalid boundary points are removed. If the filter operation relies on samples outside the acquired time frame, the analysis window is reduced (the option applied to the VIV data). This requires much smaller filter sizes than the number of snapshots to retain sufficient data after the truncation. (iii) Adding zeros to the end of the signal (zero padding). The filter operation allows a continuous transition of the decomposition between the POD and the discrete Fourier transform (DFT). Depending on the filter size, the temporal coefficients of the SPOD modes are better behaved than POD coefficients, i. e. the time evolution is smoother and with reduced spectral bandwidth. Thus, these coefficients are more suitable for deriving a ROM, especially for turbulent flows. The filter size is the only additional parameter compared to POD and should be chosen with respect to a characteristic time scale of the flow. For the VIV measurement, the selected scale corresponds to two periods of the cylinder oscillation. With the eigendecomposition of the filtered covariance matrix, the temporal coefficients b i (t) and the corresponding eigenvalues μ i (representing twice the average kinetic energy of each mode) can be obtained Since W is a real symmetric positive-definite matrix, its eigenvectors b i (t) are orthogonal and scaled with the mode energy From the projection of the snapshots onto the temporal coefficients, the spatial modes can be derived The identification of mode pairs, typically having a similar amount of kinetic energy, but shifted in phase, is essential to represent the dynamics of coherent structures. Most approaches for identification involve an inspection of the spatial modes shapes and Lissajous figures, which becomes more difficult for complex dynamics. Sieber et al. [35] introduce an unbiased method to quantify coupled dynamics of modes, using DMD. The idea is to find temporal coefficients with the same spectral content, considering a π/2 phase lag to calculate a spectral proximity measure. Therefore, it is assumed that the temporal evolution of the mode coefficients is captured by a linear operator T: With two snapshot matrices V 1 and V 2 defined as follows the operator T can be described by where V + 1 is the Moore-Penrose pseudo-inverse of V 1 . Note, that the explicit computation of V + 1 can be problematic for ill-condition snapshot matrices and Eq. (13) is only used for illustration. The problem can also be described and solved by a least-square regression. Hence, the latter option is used in the implementation of the procedure using the MATLAB backslash operator. The DMD modes can then be derived by an eigendecomposition of the operator T: The modal representation of b(t) can be written as where h i j is the spectral content of the single mode coefficient b j with respect to the eigenvalue ν i . Note, that this notation is only exact for N c = N ; however, for the described process N c < N . To mitigate measurement noise in the identification procedure, only modes with an acceptable signal-to-noise ratio should be considered for the calculation of T. Modes with very low energy are associated with a low signal-to-noise ratio. To remove these modes, the number of retained modes (sorted by energy in descending order) is truncated after N c modes. This number is calculated based on the energy resolved by the SPOD: For the VIV measurement, the modes are truncated around E(N c ) = 0.95. The amplification rates σ i and frequencies ω i of the operator T are related to eigenvalue ν i as follows: where i := √ −1. The measure of the proximity for ranking the spectral similarity of temporal coefficients b(t) is given by The elements of the skew-symmetric matrix H are sorted by decreasing magnitude. The first SPOD mode corresponds to the largest element H mn identifying the coupled modes m and n. The next lower magnitude of H, excluding the row and column of the first element, defines the second coupled SPOD modes, and the process is repeated until all modes are paired. The coupled modes b m and b n form the real and imaginary parts of a complex SPOD mode a i (t) with an amplitude A i (t) and a phase ϕ i (t) The subscript i corresponds to the sorting order. The spatial modes are paired analogously: The entire time-dependence is captured by the state vector a(t). Using the complex definition of the modes, the reconstruction equivalent to Eq. (1) now reads: Thus, the temporal evolution of the flow can be represented by deriving a ROM for the nonlinear operator F to describe the deterministic dynamics of the flow: d dt a(t) = F(a(t)).

Sparsification
While the decomposition of the velocity fluctuations into SPOD modes in Eq. (1) is exact (when all modes are used), it is desired to select a function basis using the least number of modes N m , while describing the dominant coherent dynamics sufficiently well.

Mode selection
The mode selection depends on the flow configuration, and the following points need consideration. The SPOD procedure already ranks modes in two ways. The kinetic energy ranking Eq. (6) considers the larger energetic contributions, and the harmonic correlation ranking Eq. (18) provides insight into the possible dynamic significance of the lower energy interactions. The highest ranked modes of both sortings should be selected. Note, that the highest ranked modes in both cases do not have to differ remarkably. The final model prediction can also provide insights, if sufficient modes have been selected to cover the most important interactions. For the VIV data, the first three SPOD modes with the highest energy are also those with the highest harmonic correlation. These represent the vortex shedding synchronised with the cylinder oscillation, the natural vortex shedding, and their dominating nonlinear interaction.

Computation of the derivative
To describe the deterministic dynamics, the derivative of the selected modes provides the basis for the twostage sparsification. In order to accurately estimate the numerical derivative from data, while avoiding error amplification due to measurement noise, a sixth-order compact scheme [18] was chosen. For each temporal coefficient, the unknown derivativeȧ(t) can be obtained by solving the tridiagonal linear system: Here, Δt is the time step between two snapshots and the numerical coefficients for all inner points j ∈ 3, .., N − 2 of the derivative stencil are: The definition at the boundary is shown in "Appendix C". Both the state vector and its derivative can be arranged in two matrices: . . .

Function library
It is assumed that the nonlinear operator in Eq. (22) can be described by a linear combination of a set of library functions, which are polynomial functions of the modal coefficients. For the present investigation of VIV data, polynomial combination of the coupled mode coefficients of A and its complex conjugate A up to the cubic order are used to build the library (A), which corresponds to the highest order of terms that would arise from a POD-based Galerkin projection of the Navier-Stokes equations [24]: Here, A P 2 , A P 3 , etc., denote higher order polynomials, where A P 2 denotes quadratic nonlinearities of the state vector a(t) and (AA P 2 ) denotes quadratic nonlinearities of the state vector a(t) and its complex conjugated a * (t): Generally, the library may be composed of all possible forms of nonlinear functions and polynomial combinations of the state vector to account for nonlinear dynamics. In the next step, linear-dependent library functions are removed from the library, using a Q-R-decomposition.

Sparsification procedure
Assuming that only a few of the library functions are required for the representation of F, a sparse regression problem can be formulated. The idea is to determine the important library coefficients, i. e. finding the sparse vectors of library coefficients = [ξ 1 ξ 2 . . . ξ N m ] for each mode. Hence, Equation ( 22) can be formulated asȦ Nonzero entries (active coefficients) of ξ k define which of the library functions are contributing to compute mode coefficient a k (t). Initially, all coefficients are considered possible candidates. The sparsification procedure seeks to identify the important, i. e. statistically significant library coefficients and thereby reduces the model complexity. This avoids the risk of overfitting and increases the robustness of the ROM, while maintaining its accuracy. The procedure consists of two cross-validation stages, which allow to evaluate the statistical importance of individual coefficients and minimise computational costs. A simplified conceptual schematic of the process for both stages is depicted in Fig. 3.
The first stage (conservative sparsification) focuses on the approximation of the derivative and the second stage (restrictive sparsification) on the model prediction. In both stages, the selected library coefficients are trained with a subset of the data based on the optimal approximation of the derivative and validated against the remaining data. For this, the measurement data are split into continuous data segments of length N s . Here, the length is based on the SPOD filtering operation (N s = N f ). Then, these segments are N train times randomly assigned to training a T t train and validation a T t val data. Based on exploratory trials using different distribution ratios, results using 20% data segments for training and 80% for validation yielded consistently robust models. The ratio should be chosen with respect to the amount of available data maximising the validation data while keeping enough data for training to reduce the risk of overfitting. For the VIV data, N train = 100 showed good convergence for both stages with respect to the final selection of the library coefficient.

Conservative sparsification
The calculation procedure for the conservative sparsification is presented in Algorithm 1. The principle of this cross-validation step is a bottom-up approach of a greedy algorithm to approximate the derivative for each mode, i. e. finding the optimal choice of a library coefficient in each iteration to improve the approximation of the derivative. Initially, all library coefficients are set passive, i. e. set to zero and excluded from training, and the individual coefficients are gradually activated, trained and evaluated. The training of the library coefficients for the kth individual mode for the ith training data set is computed as follows: Here, the operator • stands for an elementwise product and d i k is a Boolean vector controlling which coefficients are active and passive. Note, that this notation is chosen to indicate dependencies of the variables and to be in line with Algorithm 1, but the equations are correct even without the index for the training set variation. With the trained coefficients, the approximation residual e jk for the jth library function of the kth mode of the derivative can be derived: When all passive coefficients are individually tested, the library coefficient that shows the highest improvement of the residual in an iteration is kept (permanently) activated. This procedure is repeated until further activation of coefficients no longer result in improvements with respect to the residuals. This results in a set of possible candidates of library coefficients for each random training and validation data set, which are merged into a final list of possible candidates.

Restrictive sparsification
The calculation procedure of the restrictive sparsification is given in Algorithm 2. This procedure is a topdown approach, that starts with the candidate list of the conservative sparsification. In contrast to the first stage, the idea is to optimise the coefficient selection based on the model prediction on a system level and not for each mode individually. In an iterative procedure, models are built excluding one of the library coefficients candidates, trained and then, a short time integration is performed. As a reference, an additional integration with all candidates included is computed. The training of the library coefficients follows the same procedure as in the conservative sparsification algorithm, but for the entire system simultaneously: where D is a Boolean matrix that contains the information of currently active and passive library coefficients. Data points at the beginning of every non-training segment are used as initial values. This ensures an equal distribution and weighting of measurement data used as initial values, i. e. avoiding initial values taken from consecutive data points. As the equal-length segmentation is used on a periodic flow, the start of the segment is randomly shifted within the length of one oscillation period to prevent an over-weighting of a specific phase.
To determine the deviation of the model prediction from the true state, the following initial value problem is N p times solved (for the nth initial value t 0 ) and the residual is computed: The number of simulated time steps N t is based on two periods of the cylinder oscillation. A solver based on an explicit Runge-Kutta of fourth-and fifth-order (Dormand-Prince method) is used to integrate the model. If the integration is aborted (unstable model), the residual is penalised with r 0k . Simulations showed that this mainly occurs when the most important coefficients are excluded. With difference of the residuals (active and passive coefficient), a quality measure Q for each coefficient candidate can be computed (see Algorithm 2). The quality measure allows to determine, if a coefficient is statistically significant. Hence, all coefficients that are not statistically significant (negative values of Q) are removed. With the new list (sparse matrix D) of candidates, the procedure is repeated, until the list of candidates remains unchanged. In a final step, the resulting library coefficients (active coefficients in matrix final ) are trained using the entire measurement data as training data. Then, with a simulation with resulting ROM of the temporal coefficients a sim,T (t) , the flow field can be reconstructed.

Application to experimental data
In this section, the presented method is applied to experimental data of the VIV of a circular cylinder. The datasets analysed during the current study are available from the authors on reasonable request.

Experimental setup
The experiments were carried out at the University of Calgary in a free-surface water channel facility. Details of the experimental setup can be found in "Appendix D" and Riches et al. [26]. This work focuses on the desynchronisation regime, as it involves complex dynamical interactions of both, the motion induced by the cylinder oscillation and the vortex shedding. In this regime, the spectral signature of the velocity fluctuations and the independently measured lift force show peaks at both the shedding frequency f sh and cylinder vibration frequency, f c allowing a good demonstration of the method. The experiments were conducted for a reduced velocity of U * = 8.75. The system has one degree of freedom with the cylinder moving transversely to the flow.    Figure 6 illustrates the SPOD modes, which consist of coupled mode pairs using DMD as described in Sect. 2.1. Shown are the v-component of the spatial (paired) mode, the evolution of the corresponding temporal coefficients and the spectrum of (a i ), which is representative for the coupled mode as the spectrum of (a i ) is almost identical. The first mode in Fig. 6 corresponds to the cylinder oscillation, as f c matches the peak frequency of the independently measured lift force. The second mode is associated with the vortex shedding process, noting that its spectral energy is concentrated at the vortex shedding frequency f sh and the distribution of extrema in its spatial component matches the number of vortices depicted in Fig. 4. Similar to the measure power spectrum in Fig. 5, the peak of the vortex shedding frequency shows a broader bandwidth. This indicates some sort of variation or disturbance in the vortex shedding process. Generally, the envelope of the oscillations of the first mode tends to be inversely related to the amplitude of oscillations of the second SPOD mode suggesting an energy exchange, or interference, between these modes. The frequency of the third mode corresponds to the difference of the vortex shedding and cylinder oscillation frequencies ( f t = f sh − f c from Fig. 6). Hence, the slow varying third mode is considered to represent an interaction between the first and second SPOD mode. This interaction is expected to increase the cycle-to-cycle variation in the vortex shedding frequency and lead to a broader frequency peak in the power spectra around the shedding frequency as can be observed in the measurement (Fig. 5). The spatial component of this mode is, in contrast to the previous two, asymmetrical. This mode can account for an asymmetry shift in the flow field observed in Fig. 4. The fourth SPOD mode corresponds to the second harmonic, 2 f c , of the cylinder oscillation. Its energetic content is relatively low and the amplitude of the oscillation, or envelope, appears directly related to that of the first mode. For the SPOD, a filter length N f = 32 was used, which corresponds to two periods of the cylinder oscillation. With this choice, a good separation of the dynamics is already visible, as the mode spectra show very distinct and narrow peaks, while representing a significant amount of total fluctuation energy. Thus, the temporal coefficients are better behaved than the coefficients of a classical POD and therefore more suitable for the construction of a ROM. The influence of the SPOD filter lengths is discussed in "Appendix B". Within the scope of illustrating the methodology, only the three most energetic coupled modes are selected to derive the ROM, as this is the minimal group of modes for a meaningful representation of the process and they already describe a complex interaction. As indicated with the fourth mode, additional coherent motions are captured in modes with significantly less energy content (low impact on the dominating dynamics), so that with the chosen degree of complexity these modes are not represented by the model.

Final reduced-order model
The sparsification procedure starts with building a polynomial function library of cubic order, as illustrated in Eq. (27) using the three most energetic modes depicted in Fig. 6. The conservative and restrictive sparsification algorithms (Algorithms 1, 2) are applied to retain only the statistically significant library coefficients and corresponding functions. With this step, the number of possible library coefficients is reduced from 252 to only 15. In a final step, the model is calibrated against all available data. The resulting selection is presented in Table 1 and the corresponding system of ODEs in Eq. (38). Table 1 shows the coefficients, the calibrated values, the corresponding library functions and the system residual change Q sys of the coefficient. The system residual change represents the statistical relevance of an individual coefficient for the system description. The determination of Q sys follows the idea of the restrictive sparsification procedure, i. e. evaluating the relevance of the coefficients by comparing the model performance including and excluding the coefficient. For this, the measurement data are split into N segs equal-length segments for which the model is simulated. To avoid an over weighting of a specific cylinder oscillation phases, the initial values of the integration are shifted randomly within a period. The system residual change for each coefficient is determined as: where R all is the residual of the model using all selected coefficients, R coeff is the residual excluding the analysed coefficient and R 0 is the variance of the measured signal. With the coefficients in Table 1, the final system of ODEs then reads: da 1 dt = α 1 · a 1 + α 2 · a 2 |a 2 | 2 + α 3 · a 1 |a 1 | 2 + α 4 · a 1 |a 2 | 2 + α 5 · a 2 a 3 a * 1 + α 6 · a 1 a 1 a * 3 (38a) da 2 dt = β 1 · a 2 + β 2 · a 2 a 2 a * 1 + β 3 · a 2 a 3 + β 4 · a 1 a 1 a * 3 + β 5 · a 1 |a 3 | 2 (38b) The usage of specific library functions in combinations with the library coefficients provides insights into the flow physics. Here, the final selection (Table 1) can be categorised into different groups. The first and most important group with respect to the impact on the system are linear terms, which describe in this context a complex oscillator. The pairing of two real-valued coefficients to one complex-valued coefficient is introduced to link modes in the model, which are physically connected. On the level of the model ODE, this drastically simplifies how the oscillatory mode pair is represented. The description of linear oscillator by a pair of real-valued coefficients reads as follows [21,24]: where a r and a i constitute the real and imaginary part of the coupled mode-pair a = a r +ia i . The corresponding linear amplification rate σ and oscillation frequency ω can be similarly composed to a complex coefficient α = σ + iω. This results in the simple representation of the complex oscillator as Thus, for the linear terms, the imaginary part of the library coefficient determines the frequency of the corresponding mode. The frequencies of the modes derived by the calibrated library coefficients presented in Table 1 deviate less than 3% from the peak frequencies of the spectra in Fig. 6. Similar to a nonlinear oscillating system described by the Stuart-Landau equation [16] and the generalised mean-field model [21,24], the ROM includes nonlinear terms, where the mode a j is coupled with the amplitude but not the phase of a mode a k : Within the model, these terms address nonlinear coupling that results in changes of the linear terms described above. Thus, they influence the energy transfer between the modes, which can lead to a saturating effect stabilising the system. From a perspective of the flow dynamics, these terms can be understood as a mean field correction [24]. Further investigations of the coefficients show, for example for mode a 3 of Eq. (38c), a three mode interaction of the two complex coefficients a 1 = A 1 e iω 1 t and a 2 = A 2 e iω 2 t , which results in the observed frequency mixing. This is clearly seen from the complex-valued multiplication represented by coefficient γ 2 of the model. It represents a forcing of mode a 3 at the difference of the individual frequencies, which corresponds exactly to the observed composition of mode frequencies in Fig. 6c, where the difference of forced shedding (a 1 ) at f c = 0.128 and natural shedding (a 2 ) at f sh = 0.203 leads to the interaction mode frequency of f t ≈ 0.203 − 0.128 ≈ 0.08. The modelling process identifies mode a 3 with a negative real part in the linear coefficient γ 1 . Therefore, this interaction mode is actually damped and only observed due to the forcing by the interaction of cylinder oscillation a 1 and the natural vortex shedding a 2 .
To provide another example for the interpretation of the model coefficients, β 4 (a 1 a 1 a * 3 ) is discussed. It describes a triadic interaction of the second harmonic of the cylinder oscillation a 1 a 1 with the interaction mode a 3 acting on the vortex shedding mode a 2 . It corresponds to a frequency of 2 × 0.128 − 0.08 = 0.176, which will be further discussed with the dynamics of the model presented in the following section. Note that the physics discussed only represents correlations in the data and most likely, but not necessarily, represents actual fluid dynamic interactions.
In a comparable way, other terms can be analysed and their contributions assessed, which helps to conceptualise internal mechanisms and interaction of the underlying physics. There are two aspects that need to be considered when assessing the performance of the ROM. First, the reduced number of modes used for the model only represents around 60% of the turbulent kinetic energy (TKE) of the flow. Thus, only the most energetic and dominating flow structures are represented by the model and other coherent, and low-energy structures are not captured. Secondly, the model does not account for any random perturbation. Hence, the stochastic characteristics of the turbulence make an accurate long-term prediction with a deterministic model impossible. The deviations from constant amplitude (limit-cycle) oscillations are similarly observed for an isolated global instability in a turbulent flow [17,36]. There, the deviations from the limit-cycle are explained by turbulent perturbations and related to the turbulence intensity.
As a consequence, simulations of the final ROM often start to deviate from the actual temporal coefficients after a short period. In Fig. 7, observed and predicted evaluations of the real part of the temporal coefficients of the mode pairs are compared for representative time-series excerpts. Note, that the second part of the mode pair (a i ) shows the same behaviour. Generally, the higher-energy modes are better approximated over longer time intervals. For the lower-energy modes, for example a 3 in Fig. 7c, this deviation can be observed after two periods of the cylinder oscillation.
A more general statement of the performance can be obtained from the averaged simulation residual (for N init = 300 randomly chosen initial values): as displayed in Fig. 8. The gradient of the residual provides information on how fast the predictions deviate and to which extent the measurements are governed by stochastic influences. The convergence of the model to a certain error level indicates a general stability, as there is no unlimited growth independently of the initial values.  Figure 9 depicts samples of the observed coefficients overlaid with the concatenated segments. With a variation of the simulation duration, i. e. increasing the length and reducing the number of segments, one can evaluate how well the deterministic model performs compared to the measured statistics. Figure 10 displays the variance  Fig. 9. The definition of variance is given by: where E (X) is the expected value of the signal and Var (X) its variance. It is a representation of the average mode energy. Hence, the deviation of the predicted from the actual (measured) variance provides insight of the model energy balance. For short simulations (tU ∞ /D ≤ 50), Fig. 10 shows a very good agreement of the actual and simulated variance. This is in line with the expectation, since for these time intervals the influence of stochastic perturbations is considerably lower. However, for longer simulations the prediction of the energy balance within the model differs from that of experimental data. Here, the downward drift of the first mode Fig. 10a and the upward drift of the second mode Fig. 10b indicate that in the model more energy is transferred from the first to the second mode than observed in the data. For each segment of the concatenated coefficients, i.e. for each simulation of the temporal coefficient, a fast Fourier transform (FFT) of (a i ) is computed. The average of these spectra (for tU ∞ /D ≈ 240) is shown in Fig. 11. These spectra are representative for the mode pair and correspond to relatively long simulations, i. e. long segments (30% of the realisation data) and show that the model still captures the general, spectral characteristics of the flow. The difference between the predicted and the measured spectra can be attributed to several factors. The deviation is most obvious in the decay of the flanks next to the peak. The model only explains a narrow spectral contribution at the peak frequency of each mode, whereas the observed (measured) mode coefficients include further dynamics. The previously illustrated triadic interaction represented by the coefficient β 4 of the model describes a forcing of the vortex shedding mode at a frequency of 0.176. This causes the hump to the left of the main peak of the simulated spectrum in Fig. 11b. However, such nonlinear mechanisms explain only part of the measured (a) (b) (c) Fig. 11 Power spectrum of the mode coefficient (a i ) obtained from the SPOD of PIV measurement (blue), a simulation of the calibrated low-order model (orange); modes related to a cylinder oscillation, b the vortex shedding c their interaction spectral broadening. Since the model only includes the first three SPOD mode pairs, which account for about 60% of the turbulent kinetic energy, the difference could be due to the neglected effect of higher modes on the leading modes. The model thus captures the interactions between the large amplitude motions, but does not explicitly account for lower energy perturbations. The discarded modes, representing further coherent dynamics or stochastic fluctuations, might thus be necessary to model the observed broader spectral peaks. Instead of adding further modes, the unresolved perturbations can be modelled by a generic stochastic forcing, which effectively emulates the observed broadening of the spectral peaks [25]. This approach does not improve the predictive capabilities of the model since it only adds random perturbations. However, it allows better reproduction of the average statistics [36]. In this context, it should be noted that, although the perturbations can be considered as additive random forcing for modelling purposes, in the experimental data they are not filtered out by the decomposition. The data were recorded from the entire system including the disturbances. Therefore, the unresolved disturbances are implicitly included as frequency fluctuations of the resolved data. While there could also be a broadening of the spectra due to measurement noise, this effect is expected to be secondary, as such contributions are generally poorly correlated in space and time and are filtered out by the modal decomposition [32]. Regardless of the reason for the discrepancy, obtaining directly a more accurate model would be at the cost of a significant increase in its complexity. The potential benefit of using just a few additional modes is usually small. Since the remaining modes contain much less energy, substantial differences in the model usually require many more modes. Furthermore, this mainly translates the problem of the neglected action of unresolved modes on resolved modes towards higher modes. The alternative representation by generic stochastic forcing requires further modelling of the forcing characteristics, which is not straightforward [2,6]. The choice of the right model complexity ultimately depends on the intended purpose of the ROM, balancing captured dynamics and predictive accuracy against complexity.
With the simulated temporal coefficient and the corresponding spatial modes, a simulated velocity field can be reconstructed using Eq. (21). This allows to estimate the Reynolds stresses in the cylinder wake and gain some insight of predicted flow physics. Figure 12 displays the shear Reynolds stresses at the cross section with the maximum measured Reynolds stresses for: the measured field, a reconstructed field using the first three SPOD modes and the stresses reconstructed with the simulated temporal coefficients. The estimation of the shear stresses provides a measure to evaluate if the model can represent the influence of the large flow structures. Using the measured temporal coefficients to reconstruct the flow field provides a reference for comparison.
As both reconstructed Reynolds stresses account only for around ∼ 60% of the TKE, the extrema of these stresses are expected to be lower than the stresses derived from the measurement. However, the simulated stresses are already very close to the stresses described with the three SPOD modes. Slightly reducing the simulation length leads to an excellent match of the reconstructed stresses. Thus, with respect to the represented energy content, the deterministic ROM is capable of accurately reproducing the statistics associated with the flow physics. Further, for this flow configuration the influence of the stochastic perturbations is limited to the extend that the dominating dynamics can be captured with a few modes and represented by the model.

Concluding remarks
A robust method has been presented for deriving reduced order models from data to describe the underlying flow dynamics. The methodology is demonstrated for the turbulent wake of a cylinder undergoing vortex-induced vibration. Its robustness is achieved by establishing objective criteria in the decision process underpinning a two-step sparsification procedure. This approach aims to minimise user interaction and ensure an unambiguous procedure.
The separation of the flow dynamics is realised by a modal decomposition of the flow data. A reduced set of spatial modes represents dominating flow structures and their dynamics through their corresponding temporal coefficients. The evolution of these coefficients allows a simplified description of the dynamics. A system of ODEs is constructed using linear and nonlinear combination (here polynomial) of the coefficients to model the dynamics. These combinations define a library of possible functions. The critical step in the presented method is the optimised selection (sparsification) process to identify the minimal number of relevant library functions and corresponding coefficients implementing a two-stage cross-validation procedure. The first stage (conservative sparsification) works as a prefiltering of the library coefficients, where all of these coefficients are step-wise tested and relevant coefficients with respect to the approximation of the derivative are kept. In the second stage (restrictive sparsification), an iterative process is used to quantify the relevance of each library coefficient for the system and remove non-physical coefficients. Briefly, these coefficients are used to build models, validated them against the measured evolution of the temporal coefficients and perform short-time predictions. The final selection of library coefficients follows a statistical significance test by the successive exclusion of coefficients.
The major improvement gained through the sparsification procedure addresses three challenges in particular. Firstly, in contrast to previous approaches, it not only identifies coefficients that influence the approximation of the derivative, but also determines how sensitive the prediction is to a given coefficient. This enhances the reliability and robustness of the final ROM. Secondly, introducing cross-validation in both stages of the sparsification reduces the risk of overfitting. Within the conservative sparsification, this is accomplished by using randomly chosen segments of the computed derivative to train the coefficients and the non-training segments for validation. For the second stage, the training is performed in a similar way, but the validation is based on the prediction of the temporal evolution of the non-training segments and not its derivative. Using only 20% of the data for training and 80% for the validation further minimises the risk of overfitting. Finally, a central goal of the new sparsification algorithms is the definition of an objective criterion for the termination of the sparsification procedure. Thus, the system identified is not sensitive to user-selected thresholds and the complexity of the model is determined by the data.
The choice of decomposition technique is not considered essential to the process. However, the proprieties of the decomposition do influence the general performance of the ROM. Here, SPOD was chosen because it enables an efficient separation of the dynamics and leads to well-behaved temporal coefficients. This is beneficial for the construction of ROM and increases the probability, that the process results in an accurate and robust representation of dominating flow dynamics. Another optional step is the identification and coupling of modes pairs with a DMD and their representation as a complex mode. By minimising the number of possible library functions and forcing the process to handle modes as paired couples, the computational costs in the sparsification procedure are reduced and the description of the system simplified. The procedure can be applied using real valued modes and without prior coupling.
The number of selected modes is a balance between model complexity and the dynamics resolved by the model. Note, this applies to all approaches based on modal decomposition. Additional models using a larger number of modes have also been computed. The resulting models are more complex but benefit the representation by incorporating interactions between the higher energy modes and the added modes. However, it remains that the more complex models do not change the basic form and relationships between the modes presented in this work. Consequently, the simpler model is found to be more suitable for an initial demonstration of the method.
The results from the presented method show its capability to derive a robust ROM solely from measurement data. It can capture the deterministic part of the underlying dynamics of a turbulent wake. The only exogenous parameters that need to be defined within the process can either be derived from dynamical characteristics (e.g. vortex shedding frequency) of the flow or address procedural adjustments such as the number of repeated iterations on different parts of the data. Moreover, the process design exhibits a low sensitivity to control parameter variations. Hence, the process converges towards a final model, i. e. robustly identifying the same basis functions and library coefficients repeatably. Additionally, the analytic structure of the utilised library functions resembles the terms of a Galerkin projection of the Navier-Stokes equations. Thus, it enables to gain insight of the model mechanics and interactions by analogy. Since specific terms of the ROM can be related to fluid mechanical phenomena, the model also allows an interpretation of flow physics. Due to its fast prediction capability, the model may potentially be suitable for applications where rapid approximations of the flow evolution are critical, such as for active flow control, process monitoring and assessing design performance Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Symmetrisation
In case the investigated flow field shows a symmetry, it is recommended to make use of symmetrical properties. This enables an improved separation of the dynamics with a modal decomposition. The wake of a circular cylinder undergoing VIV showed a symmetry with respect to the y = 0-plane. Thus, before performing a modal decomposition, the fluctuation is decomposed into a symmetric v (s) (x, t) and an asymmetric v (a) The symmetric and asymmetric split is computed as follows: With both, POD and SPOD, the symmetric and asymmetric fluctuations can be described by a sum of spatial modes i (x) and corresponding temporal coefficients b i (t): With Eqs. (46), ( 47) and ( 49), the symmetric and asymmetric part can be written in a combined sum: Then, Eq. (49) can be written as a combined sum: where the modes are ordered by energy. From here on, the procedure continues as described in Sect. 2.

B SPOD filter
SPOD results of the symmetrised VIV data for two filter lengths (N f = 0, 32) are illustrated in Fig. 13. A limiting case, computed with the minimal filter length (N f = 0), is equivalent to results of a classical POD. The case using N f = 32 corresponds to the results discussed in Sect. 3. This filter length approximately corresponds to a characteristic time scale of the flow (two periods of the cylinder oscillation, N f · Δt = 1.33s, T p = 1/ f c ≈ 0.67s). For each case, so-called SPOD spectra [35] are depicted in Fig. 13 (left side), where a single dot represents a coupled mode pair. The size of the dote scales with the magnitude of its harmonic correlation H i j determined according to Eq. (18). The colour also indicates the magnitude and is used to enhance visibility. The frequency f of the identified mode pair is computed by the weighted sum of the DMD eigenvalues ν k (Eq. 14) sorted in descending order [35]: Here, an averaged over three eigenvalues (n = 3) is used, to mitigate possible corruption by noise. The relative energy content K , with respect to the total energy, of the mode pair is given by: The subscripts i and j refer to the indices of the coupled SPOD modes. Three modes are selected (indicated by numbers), which are related to the cylinder oscillation (1), the vortex shedding (2) and their interaction (3). In Sect. 3.3, these modes are discussed in detail for N f = 32. To the right of each SPOD spectrum (Fig. 13), the spatial mode shapes v,i (x) (upper row) and the power spectral density (PSD) function of the real part of the corresponding temporal coefficients (lower row) are displayed for the three selected modes. The spatial mode shapes are visualised by the v-component of the velocity. For a compact representation, the mode spectra and spatial mode shapes are given without axis labelling. The axis corresponds to analogous plots in Fig. 6. The POD (N f = 0) spectrum in Fig. 13a shows few modes with a high harmonic correlation at lower Strouhal numbers St = f D/U ∞ and a high energy content K . In the following, the effect of the filter size is primarily demonstrated for the selected modes. The spatial shapes of these modes are clearly symmetric (1) and (2) or anti-symmetric (3). The mode spectra of the corresponding temporal coefficients show broad bandwidth with several higher peak frequencies at different Strouhal numbers. Some of these frequencies are present in multiple spectra. For example, all three of the mode spectra show peaks at St = 0.128 and St = 0.209. This indicates that the dynamics are not fully separated. In contrast, the SPOD (N f = 32) spectrum yields similar modes with respect to the spatial mode shape, the peak frequency and energy content, but with a higher harmonic correlation, Fig. 13b. Only minor differences in the spatial shapes for the selected modes are observable, e. g. slight shifts in the distribution of the extrema. Note, that due to the previously applied symmetrisation, the effect of the filter operation on the spatial mode shapes is reduced. The mode spectra, however, are significantly changed. Instead of showing multiple high spectral peaks, Fig. 13a, each of the SPOD mode spectra is centred around a single frequency (Fig. 13b) with narrow bandwidth, so that the peaks, such as St = 0.128 and St = 0.209, are now associated with a specific SPOD mode. Briefly, when comparing SPOD to the POD modes, the spatial mode shapes and the energy content remain almost unchanged. However, the mode spectra indicate that with the SPOD a better separation of the dynamics can be achieved resulting in temporal coefficients with an increased harmonic correlation.
The filter size N f is an important parameter of the SPOD. Gradually increasing N f allows a continuous transition from POD to DFT. Accordingly, the spectral separation appears to be continuous, too. Selecting the filter size is a compromise between two competing effects. Increasing N f increases the spectral separation at the cost of reducing the energy content of each mode. Thus, the selection of the filter size is not critical in the sense of finding a single optimal parameter, but identifying the range, where the dynamics are separated and the modes represent the most energy. Selecting the filter size based on a characteristic time scales of the flow showed suitable results for different flow configurations [35].

C Numerical derivative
Assuming non-periodic boundary conditions, the coefficients for the sixth-order compact scheme at the boundary are defined as follows:

D Experimental data
Velocity and force data were obtained from a study conducted at the University of Calgary [27]. A cyberphysical apparatus [8,11,44], described in Riches et al. [26], was used to operate a real-time simulation of the mechanical system described in Eq. (55) mÿ + cẏ + ky = F y , where m, c, and k are the structural mass, damping, and stiffness, respectively. F y is a time-dependent forcing from the wake dynamics in the transverse direction of the incoming free stream flow [13]. The measured forces are used as an input to calculate the kinematics, so that the structural mass, stiffness and damping can be modelled in software. With a motorised traverse, the cylinder is then forced to follow these kinematics. The experiment was conducted to analyse the wake of a hollow acrylic cylinder (diameter D = 19.05 mm and length L ≈ 20D) with a non-dimensionalised mass ratio m * and a damping ratio ξ : at Re = 4000 and U * = 8.75. Data of the time-resolved two-component planar PIV were collected in a plane of the cylinder midspan. The images were captured with a high-speed Phantom Miro M340 digital camera (2560 × 1600). The flow was seeded with 11.7μm hollow-glass spheres, and the plane was illuminated by a laser sheet from a Photonics 20 mJ Nd:YLF high repetition rate pulsed laser with a wavelength of λ = 527 nm. Both, camera and laser were configured to operate in dual-pulse mode at an effective data image per acquisition rate of 24Hz. For the processing of the PIV images, the LaVision DaVis 8.3 software suite was used. The time-resolved velocity vector fields were calculated using a multi-pass cross-correlation algorithm with a final interrogation window size of 24 × 24 pixels resulting in a vector spacing of 0.725 mm. More information and details of experimental setup and technical components are available in Riches et al. [26] and [27]. Within three realisations of the experiment, a total number of 4767 vector fields (1589 per realisation) were obtained at a sample frequency of 24Hz comprising approximately 298 periods of the cylinder oscillation.