Suppression of wind turbine noise from seismological data using nonlinear thresholding and denoising autoencoder

Seismologists found a significant deterioration in station quality after installation of wind turbines (WTs), which led to conflicts between WT operators and seismic services. We compare different techniques to reduce the disturbing signals from WTs at seismological stations by selection of an affected station. WT noise and earthquake signals have overlapping frequency bands, and thus spectral filtering cannot be used. As a first method, we apply the continuous wavelet transform on our data to obtain a time-scale representation. From this representation, we estimate a noise threshold function either from noise before the theoretical P-arrival or using a noise signal from the past with similar ground velocity conditions at the surrounding WTs. As a second method, we use a denoising autoencoder (DAE) that learns mapping functions to distinguish between noise and signal. In our tests, the threshold function performs well when the event is visible in the raw or spectral filtered data, but it fails when WT noise dominates. The use of the threshold function and pre-noise can be applied immediately to real-time data and has low computational cost. Using a noise model from our prerecorded database at the seismological station does not improve the result and is more time consuming. In contrast, the DAE is able to remove WT noise even when the event is completely covered by noise. However, the DAE must be trained with typical noise samples and high signal-to-noise ratio events to distinguish between signal and interfering noise.


Introduction
Over the last years, the installation of wind turbines (WTs) increased in Germany due to a substantial development of renewable energy. Owing to negative effects on humans, WTs are often installed in areas with low population density. Because of low anthropogenic noise, these areas are also well suited for sites of seismological stations. As a consequence, Abstract Seismologists found a significant deterioration in station quality after installation of wind turbines (WTs), which led to conflicts between WT operators and seismic services. We compare different techniques to reduce the disturbing signals from WTs at seismological stations by selection of an affected station. WT noise and earthquake signals have overlapping frequency bands, and thus spectral filtering cannot be used. As a first method, we apply the continuous wavelet transform on our data to obtain a time-scale representation. From this representation, we estimate a noise threshold function either from noise before the theoretical P-arrival or using a noise Wolfgang Friederich equally contributed to this work. Vol:. (1234567890) WTs are often installed in the same areas as seismological stations. Several studies investigated the imprints of WT noise on seismological recordings (Styles et al. 2005;Saccorotti et al. 2011;Stammler and Ceranna 2016;Estrella et al. 2017;Neuffer and Kremers 2017;Zieger and Ritter 2018;Neuffer et al. 2019;Lerbs et al. 2020;Neuffer et al. 2021) and characterised the wavefield of the disturbing WT signals. All mentioned studies come to the conclusion that the data quality is deteriorating due to the expansion of WTs. Stammler and Ceranna (2016) analysed time spans before and after installation of WTs at the Gräfenberg array (GRF) in Germany and show a strong dependence of local wind speed and noise spectra after the installation of new WTs. They conclude that the WT signals reduce the sensitivity of the GRF stations in a frequency range between 1 and 7 Hz, which is crucial for the analysis of seismic events. Furthermore, they noticed that installations of WTs within a radius of 5 km around a seismological station lead to a worsening in station quality. Therefore, Stammler and Ceranna (2016) suggest to prohibit additional installations within a distance of 5 km. Neuffer and Kremers (2017), Estrella et al. (2017), Zieger and Ritter (2018), and Neuffer et al. (2019) further substantiate this suggestion by estimating power spectral density (PSD) amplitude decay curves in the frequency band between 1 and 10 Hz. The suggestions by Stammler and Ceranna (2016) and the amplitude decay curves of other studies led to definitions of protection radii and radii of shared interest of the WT operators and the seismological data services, e.g. in the states of Bavaria (Windenergie-Erlass -BayWEE 2016) and North Rhine-Westphalia (Windenergie-Erlass NRW 2018). Thus, some stations are protected within a radius of 5 to 15 km, where the installation and operation of wind turbines is prohibited. The protection zone around a seismological station always depends on the importance of the station (e.g. 15 km for stations, which are used for the Comprehensive Nuclear-Test-Ban Treaty (Windenergie-Erlass -BayWEE 2016)), and the local geological and topographical settings (Estrella et al. 2017). At other sites, the WT constructors have to inform the seismological data services when planning new WTs close (several kilometres) to seismological stations and they may have to pay for a new station, when the station quality is degraded by the new WT.
Due to the large protection radii, a lot of potential area is lost that could be used for wind energy. In order to achieve decarbonisation of the energy sector for both private and industrial consumption, it is important that as much available land as possible can be used for renewable energies and especially wind energy.
Since the WT noise and the earthquake signals share a common frequency band, spectral filtering, based on the Fourier transform, which is normally used to suppress the noise, cannot be used. Furthermore, spectral filtering alters the waveform and can affect subsequent analyses. Mousavi et al. (2016) and Langston and Mousavi (2019) modified time scale coefficients after transforming time series data using the continuous wavelet transform (CWT) to effectively suppress noise. Even when signal and noise share a common frequency band, their method can separate signal and noise, but the signal is always visible in the noisy raw data in the examples shown, which is not always true for real data. Sometimes, the earthquake signal is completely covered by noise and therefore not visible in either the time or the time-frequency domain. Zhu et al. (2019), Mandelli et al. (2019), and Tibi et al. (2021) have developed a new approach that uses a convolutional denoising autoencoder (DAE) to suppress white noise, a variety of coloured noise, and non-earthquake signals at seismological stations. During the training process, the DAE learns to distinguish between signal and noise by simultaneously learning a variety of sparse representations from the input time-frequency coefficients and functions that map these representations into two individual mask functions for signal and noise, respectively. Zhu et al. (2019) and Tibi et al. (2021) used the short-time Fourier transform (STFT) to calculate the time-frequency coefficients. Mandelli et al. (2019) applied their DAE on 2D common shot gathers.
In our study, we attenuate noise from WTs at a seismological station by using two different denoising techniques. As a first method, we use nonlinear thresholding (Langston and Mousavi 2019), where time-frequency coefficients exceeding a certain threshold are attenuated. As a second method, we use a modified DAE Mandelli et al. 2019;Tibi et al. 2021), which takes a large training dataset to learn how to separate signal and noise. The DAE uses either the STFT as done by Zhu et al. (2019) and Tibi et al. (2021) or the CWT to transform the noisy raw data into the time-frequency or timescale domain. STFT and CWT have different resolutions (e.g. Mallat 2009), so CWT may be able to suppress noise more accurately, which was not tested in previous studies. Nonlinear thresholding can be applied on real data without any training, whereas the DAE has first to be trained. We compare raw data, bandpass filtered data, the results of the nonlinear thresholding, and the DAE, using different techniques for the estimation of the threshold and using STFT and CWT to determine the time-frequency coefficients for the DAE.
The paper is structured as follows: First, we present the selection of a suitable seismological station that is deteriorated by the installation of new WTs. Second, we introduce the different denoising techniques, where we focus first on nonlinear thresholding, the CWT, and how we tested different noise models to suppress the noise. Afterwards we introduce the DAE, show the differences to the existing DAEs by Zhu et al. (2019) and Tibi et al. (2021), and present the dataset that we used to train the DAE. In the last part, we show the results of the denoising techniques on real seismological data.

Site selection
To find a seismological station, which is negatively affected by surrounding WTs, we compute hourly power spectral density (PSD) curves for several stations in the state of North Rhine-Westphalia (NRW, Fig. 2a) and bin those curves by the local wind speed by the method as proposed by Stammler and Ceranna (2016). The results are average PSD curves for each wind speed bin. Figure   When the spectral lines in Figure 1 are close together, there is no evidence for any correlation of wind speed and noise level (Stammler and Ceranna 2016). In 2012 (Fig. 1a), there exists already some wind dependency, which might result from a WT installed in 2009 or from trees in the surroundings (Fig. 2b), which also emit short-term seismic noise (Bormann and Wielandt 2013). In contrast in 2019 ( Fig. 1b), there are distinct noise peaks at 3.2 Hz and 4.2 Hz which grow with wind speed. The comparison of the yellow PSD curves (Fig. 1) shows a deterioration of 20 dB between 2012 and 2019, which is a tenfold increase compared to the original noise amplitude. During this time period, five new WTs of different types were installed around the seismological station (Table 1). PSD curves at other stations where fewer or no WT were installed do not show similar wind dependencies. For this reason, we selected the seismological station RN.BAVN for our study. The geology at this site can be described as unconsolidated sediments and consists of Upper Cretaceous sands.
After selection of the seismological station, we installed short period seismometers (Raspberry Shakes RS3D and Mark L-4C3D) close to each WT (50 to 80 m away from the foundation). The new seismometers record at a sampling frequency of 100 Hz. Figure 2b shows the locations of all seismometers (red triangles) at this site as well as the locations of the WTs (blue stars). Furthermore, to connect the operation parameters (wind speed at the hub height and rotation speed) and the recorded seismological data, the WT owner kindly provided the operation parameters of each WT, except one. At this WT, we did not install a new station (Fig. 2b). The new stations are used to find similar signals in WT noise to get an estimate of the disturbing signal at station RN.BAVN (see Section 3.1).

The CWT and nonlinear thresholding
Since the short-time Fourier transform (STFT) is limited in time-frequency resolution, we use the CWT to transform signals into time-scale representation. The CWT takes two independent variables, scale a and time lag , and is defined as (Mallat 2009) where Ψ(t) is the mother wavelet and the asterisk represents the complex conjugate. Thus, the CWT is a correlation of the time series x(t) with a scaled and dilated wavelet Ψ(t) . A wavelet is admissible when it has zero average, i.e. and when the integral exists. Here, Ψ ( ) denotes the Fourier transform of the mother wavelet. Transforming from time-scaledomain back into time-domain is done by the double integral (Mallat 2009) In this study, we use the implementation of the CWT and its inverse for discrete sequences by Torrence and Compo (1998). An example of an admissible mother wavelet is the Morlet wavelet, which is a Gaussianmodulated sinusoidal pulse (Farge 1992;Torrence and Compo 1998): where 0 is a nondimensional frequency and is set to 6 (Farge 1992; Torrence and Compo 1998) and i denotes the imaginary unit. The real and imaginary parts of the wavelet are shown in Fig. 3. To compare the results of STFT and CWT, pseudo-frequencies f a can be determined by the relationship f a = f 0 a (e.g. Stockhausen 2016), in which f 0 is the centre frequency of the scaled wavelet and a is the scale.
To attenuate disturbing noise, Donoho and Johnstone (1994) and Donoho (1995) introduced the concept of nonlinear thresholding in the wavelet-domain using an universal threshold. They assume that the recorded signal x(t) is a superposition of the seismic signal s(t) and some additive noise n(t) and present a denoising method in which wavelet coefficients below some scale-dependent threshold (a) are considered as noise and suppressed. Since the noise is usually unknown, Donoho (1995) presents a method of keeping wavelet coefficients if they are greater than a threshold function (a) . Otherwise, they are classified as pure noise and are set to zero. This is called hard thresholding. Hard thresholding results in lower signal-to-noise ratio (SNR) if noise outliers are above the threshold function. Alternatively, thresholding can also be done by shrinking all coefficients that exceed the threshold function. This concept is called soft thresholding and is given by (Weaver et al. 1991;Langston and Mousavi 2019) where The threshold function (a) is determined from a noise estimate for each wavelet scale a and is estimated as (Langston and Mousavi 2019) where Q P denotes the quantile function and P is the quantile value. Following Langston and Mousavi (2019), it is not necessary for the noise estimate to be characterised by a particular distribution function such as a Gaussian distribution. The noise estimate is either determined from a time window before (pre-noise) or after the expected event or from a different time window in the past or future with similar noise conditions. To find such time windows, we use records from the sensors close to the WTs. We hypothesise that, if the noise recorded close to a WT in the event window and a different time window are very similar, this also holds for the noise at the seismological station because the noise propagates through the ground from the WT to the station. Once such a highly similar time window is found in the records close to the WT, we can extract the waveform at the seismological station at the given time and use it as a noise model to determine the threshold function ( ).
To check this hypothesis, we took records form the sensors close to the WTs and calculated the Euclidean distance between the noise in the event window and other windows of equal length before the event. The normalised Euclidean distance of a master signal t i and a subsequence t j is defined as where i and j are the mean of each subsequence and n is the length of the master signal t i . Figure 4a shows an example of a synthetic time series where the similar signals with different amplitudes (black, red, blue, and pink signals in Fig. 4a) are hidden. The black signal is used as the master signal and the distance profile is computed (eq. 10) for this signal (Fig. 4b). The resulting distance profile is zero (global minimum) at the timestamp of the master signal and has local minima at the timestamps of similar signals. These are marked as stars in Fig. 4b.
Example of a synthetic time series (a), a master signal (black), and the distance profile of the master signal and the time series (b). The red, blue, and pink signals have a different amplitude scaling but the same shape as the master signal and are hidden in the time series in (a). Local minima in the distance profile (black, red, blue, and pink stars in (b)) indicate the location of these similar signals in the time series. When extracting a similar signal using the distance profile, the location of the local minimum is taken as the starting point

Distance Pro le
Compared to cross-correlation, which is commonly used in seismic data processing to extract similar signals, Euclidean distance is computationally more efficient for large datasets Zhu et al. 2016; Akbarinia and Cloez 2019). An average distance profile is formed from all distance profiles calculated at each WT with the event window as master signal and the average distance profile is searched for minima. The global minimum is where the master signal is located in the time series. Therefore, the global minimum is truncated and the minimum with the second smallest Euclidean distance is taken as the starting point for the most similar signal. The final step is to extract the waveform at the seismological station at the time of the minimum in the average distance profile, as this waveform follows from similar conditions at all WTs. Figure 5 summarises the workflow. Figure 6 shows an example of such a similarity search using three WT stations. The red waveforms are the master signals for each WT and the grey waveforms are the most similar signal at the WT in a small range around the minimum of the average distance profile (Fig. 6a-c). Figure 6d shows the extracted waveform which is used as a noise model to estimate the threshold function (a) (eq. 9, red line) and the noisy waveform with an expected event in the middle (grey waveform). After applying the threshold function, the modified CWT coefficients Ŵ (a, ) are transformed back into time domain to obtain the denoised waveform. Figure 7 summarises the dataflow of nonlinear thresholding using either pre-noise or the extracted waveform as a noise model.

Denoising autoencoder
Zhu et al. (2019) introduced a novel time-frequency denoising technique, where a convolutional neural network (CNN) learns sparse representations of spectrograms and maps this representation into two mask functions for signal M S (t, f ) and noise M N (t, f ) , respectively. In general, this method is called a denoising autoencoder (DAE). We start from the time-frequency representation of equation 6: The objective is to separate signal and noise by the computation of a denoised signal and the recovered noise where TFT −1 denotes the inverse time-frequency transform. The DAE learns two mask functions and for signal and noise, respectively. During the training process, signal and noise samples from a database are added to construct a noisy signal. Since the DAE learns from real data, it is not required that the noise samples are characterised by a specific distribution function. The true mask functions ( M S (t, f ) , M N (t, f ) ) interact as labels of the CNN and the DAE has the objective to minimise the expected error E between  Extract the waveform at the seismological station at the time of the minimum in the distance pro le the true mask functions and the predicted mask func- where i is either S or N and ‖⋅‖ 2 denotes the Euclidean norm. The mask functions for signal and noise consist of real values ranging from 0 to 1. The network architecture is similar to those CNNs presented by Zhu et al. (2019) and Tibi et al. (2021) and is shown in Fig. 8. However, the input size and the depths of the CNN differ by the chosen time-frequency representation and we add dropout to each convolutional layer to prevent the neural network from overfitting (Srivastava et al. 2014). We also use skip connections to improve the convergence of the network (Ronneberger et al. 2015;Liu and Yang 2018). The input layer consists of the real and imaginary parts of the previously estimated time-frequency representation, which are min-max-normalised independently of each other, i.e. the values of the input layers are transformed to the range [0, 1]. In the first half of the network (encoder), the input data are converted into a multitude of sparse representations by passing them through a series of 2D convolutional layers with a constant filter size of 3 × 3 , followed by a rectified linear unit (ReLU) as activation function, batch normalisation (BN) to improve the optimisation of the network (Ioffe and Szegedy 2015; Goodfellow et al. 2016), and a droput layer

Fig. 6
Example of similarity search of a master signal (red waveforms) at three WTs (a-c) and the resulting extracted waveform (red line in (a-c)) that has the smallest Euclidean distance (distance in a-c) to the master signal and is thus the most similar signal. After finding the most similar signal at each WT in a given time window, the waveforms are extracted at the seismological station (d). The grey waveform is the noisy signal and the red waveform is used to determine the threshold function ( ) for the denoising process. The waveforms at the WTs used to determine the distance profile have a length of 6 s (a-c), while the noise model (d) has a length of 60 s to obtain a more representative noise model (blue arrows in Fig. 8). To reduce the size, the output of the layer above is passed through a 2D convolution with constant filter size ( 3 × 3 ) and strides, followed again by a ReLU, BN, and a dropout layer (orange arrows in Fig. 8). The stride size depends on the size of the input layer. We used either 2 × 2 or 2 × 3 as stride size for STFT and CWT, respectively. In the second part of the network (decoder), the sparse representations are upsampled using 2D deconvolution (transpose of the convolutional layers) to generate a mapping function for the two output masks. Since the dimension does not fit in every case with transposed convolution, a cropping layer is added to get the correct size (green arrows in Fig. 8). The last hidden layer is convolved with a 1 × 1 filter and passed through a softmax activation function (Goodfellow et al. 2016) to generate the mask functions for signal and noise, respectively. During the training process, the pre-estimated mask functions act as labels and the network learns how Since the up-sampled layers do not have to be the same size as the down-sampled layers, these layers are cropped during upsampling. Dots indicate a repetition of the previous operations. Batch normalisation and skip connections are used to improve the convergence of the neural network. A softmax layer is used as the last layer to generate the mask functions for signal and noise to map from the real and imaginary part of the time-frequency coefficients to the sparse representations, and from there to the output layer by optimising the binary-crossentropy that is used as the loss-function. Figure 9 summarises the data flow of the DAE. First, the noisy time series (Fig. 9a) is transformed into time-frequency domain either by CWT or STFT (Fig. 9b). The time-frequency coefficients are then fed into the previously trained neural network (Fig. 9c), which has learned to distinguish between signal and noise during the training process. The network outputs two individual mask functions for signal and noise which are element-wise multiplied with the noisy time-frequency coefficients to get two time-frequency representations for signal ( Fig. 9d(I)) and noise (Fig. 9d(II)). Finally, the time-frequency coefficients for signal and noise are transformed back into time domain to obtain the time series for signal ( Fig. 9e(I)) and noise (Fig. 9e(II)) which can then be used for further analysis.

Data and network training
To train the CNN, we use the Stanford Earthquake Dataset (STEAD) which contains 1,050,000 local earthquakes . We select signals with high SNR (SNR ≥ 20 dB) and use 100,000 of these seismograms as a training dataset. Each waveform is 60 s long and is recorded with a sampling rate of 100 Hz. More than 100,000 waveforms with a duration of 60 s are randomly taken as noise samples from the affected station. To ensure that the time window contains only pure noise, the data are checked by an STA/LTA (Withers et al. 1998) for any spikes and the periods in which earthquakes have been detected at surrounding stations are excluded. For a more accurate denoising, we further selected noise samples close to the expected onset of the P-phase, which is estimated from onsets at other stations in the seismological network. During the training process, at the beginning of each training iteration (epoch), the earthquake dataset is randomly split into one training dataset comprising 80 % of the full dataset and one validation dataset (20 %). At the start of each epoch, first the signal is randomly shifted (data augmentation), then a noise sample is randomly selected from the noise dataset, and both signal and noise are scaled independently by a positive random number. As a last step, we add noise and signal to construct a noisy seismogram. By applying these steps, we build a dataset with millions of different signal and noise samples for three component seismograms. We trained the DAE on an Nvidia Quadro RTX 6000 GPU with a batch size of 16 for CWT and 32 for STFT, for a maximum number of 200 epochs, and we used the Adam optimiser (Kingma and Ba 2014) with a learning rate of 0.001. The input layer has size 121 × 100 × 2 when STFT is used and 3001 × 200 × 2 when CWT is used. With CWT, the data are downsampled by a factor of 2; otherwise, the input layer would be too large for the memory. The loss of the validation data is monitored during the training and training stops when the validation loss does not improve further. The CNN training takes a few hours on a single GPU if STFT is used as the time-frequency representation, and up to several days for CWT. The network is built using the Python package Keras (Chollet 2015) on top of the machine learning platform Tensorflow (Abadi et al. 2015) and contains ∼ 1.2 million trainable parameters and ∼ 2500 nontrainable parameters (parameters that are not updated during the training process, e.g. mean and standard deviation in BN layers), when STFT is used as time-frequency representation and ∼ 19 million trainable and ∼ 10, 000 nontrainable parameters, when CWT is used. The difference results from the different resolutions of STFT and CWT, respectively.

Performance of the DAE
To evaluate the performance of the trained network, we create a test dataset from events and noise signals, which were not used for training. Our test dataset contains 5000 noisy waveforms. Since the waveforms of STEAD ) contain phase onsets, we are able to determine the SNR in decibels (dB) for the raw, noisy, and denoised signal by where A S is the root mean square for a 3-s-wide window around the P-arrival and A N denotes the root mean square for a 3s-wide window before it. The results are shown in Fig. 10a for CWT and in Fig. 10b for STFT as time-frequency transformation, respectively. Both networks are not able to reach the original SNR (grey bars) but in general, both methods result in a higher SNR (blue bars) in comparison to the noisy signals (green bars).
The mean SNR of the denoised data increases by 26.6 dB for both CWT and STFT (Fig. 10). Comparing the denoised histograms, CWT achieves a slightly better result than STFT, as the first standard deviation of CWT is slightly larger than that of the STFT. Furthermore, the denoised test dataset using the CWT has more events in the range 35 to 45 dB ( Fig. 10a and b).
We apply both networks on real data at station RN.BAVN. For this, we use all events with a magnitude greater than 1 between 2017 and 2020 in the state NRW, Germany (2421 events), and use the event list of the seismological observatory at the Ruhr-University Bochum. Since many events are fully covered by noise at the seismological station, we estimate the arrival times at the station from analysed events at the surrounding stations. Thus, we are able to determine the SNR before and after denoising. The mean SNR is increased by 4.4 dB (Fig. 10c) and 4.2 dB (Fig. 10d) for CWT and STFT, respectively. Again, the first standard deviation for the CWT is slightly larger than for STFT. The histograms show clearly that many events with low SNR retained a low SNR after denoising. The CWT also performs slightly better on real data ( Fig. 10c and  d), especially when comparing denoised events with an SNR ≥ 40 dB.
The improvements in SNR for the test and the real dataset are shown in Fig. 11. Both trained DAE models are able to improve the SNR of the noisy test dataset (Fig. 11a), but applied on the real dataset, the improvement is less (Fig. 11b). The average improvement for the test dataset is 20.6 dB and 4.6 dB or 4.4 dB for the real dataset when either CWT or STFT is used. Few events are improved by more than 30 dB (Fig. 11b). From both datasets, it follows that the DAE using the CWT performs slightly better, especially on real data, as the first standard deviation is slightly larger with the CWT (Fig. 11b).

Fig. 10
Histograms showing the performance of the DAEs for CWT and STFT . Grey bars represent SNR distribution of data from STEAD , green bars are noisy data, and blue bars are the SNR distribution after denoising. Note that the semitransparent bars overlap each other, which changes the colour. In (a) and (b), the DAEs are tested on a dataset that contains 5000 noisy waveforms (green bars). This dataset was created by adding real seismic noise from the station RN.BAVN to earthquakes from STEAD (SNR ≥ 20 dB) that were not used during the training process (grey bars). After applying the denoising method, the SNR was determined to check whether the SNR improves compared to the noisy dataset and whether the DAE is able to recover the original dataset. In (c) and (d), the DAEs are applied to a real dataset at station RN.BAVN between 2017 and 2020. The dataset used contains 2421 waveforms. The average SNR of each dataset is given by the values in the top right corner of each subplot, with the associated first standard deviation in parentheses either a noise window before the first arrival or a noise model estimated from the similar signals at the WTs (eq. 10). Figure 12 shows a waveform where an event is detectable in the raw data (Figs. 12a(I) and 16a(I)). The time-frequency representation ( Fig. 12b(I)) shows some high energy around 4 Hz. Applying the bandpass filter leads to an increase in SNR by 6.9 dB (Figs. 12a(II) and 16a(II)) and it also removes all signals, which are not in the passband. Using the threshold function estimated in a 30-s-wide time window before the first arrival yields an increase in SNR by 22.9 dB (Figs. 12a(III) and 16a(III)). Using a noise model only results in an increase of 11.8 dB (Fig. 12a(IV)), but the noise level is lower than in the raw or filtered data. However, the timefrequency representation still contains some high energy spots around 4 Hz, which are not considered as noise by the threshold function. Both trained DAEs (Figs. 12(V), (VI), and 16a (V), (VI)) are able to fully remove the noise before the first arrival, which results in an increase in SNR of around 30 dB. Also, the time-frequency representations do not contain any high energy spots before or after the event.
A second example shows an event that is also visible in the raw data (Figs. 13a(I) and 16b(I)), but whose SNR is not improved by the bandpass filter (Figs. 13(II) and 16b(II)). Again, using pre-noise for the threshold function suppresses the noise well (Figs. 13(III) and 16b(III)) and increases the SNR by 14 dB. In comparison, the pre-estimated noise model still has a moderate noise level, which results in an increase in SNR by only 6.4 dB compared to the raw data. The DAE that uses the CWT as time-frequency representation has the highest increase in SNR of 32.9 dB (Fig. 13(V)). Both DAEs suppress the noise well, but for the DAE using the STFT (Figs. 13(VI) and 16b(VI)) some noise remains before the first arrival, which explains the difference in SNR.
In the third example (Figs. 14(I) and 17a(I)), the SNR is low in the raw data and in the time-frequency representation some high energy peaks can be recognised after 30 and 50 s. Applying the bandpass filter on the raw data does not lead to an improvement (Figs. 14(II) and 17a(II)). Both thresholding functions make the event easier to detect (Fig. 14(III) and (IV)), but the resulting waveforms have spikes, which hinders further analyses. The DAEs increase the SNR by more than 20 dB and suppress the noise well (Figs. 14(V), (VI), and 17a(V) and (VI), b(V), and (VI)). Comparing the amplitudes of both waveforms demonstrates that the DAE using CWT has more loss in amplitude than the DAE using STFT.
The final example in Figs. 15 and 17c shows an event that is fully covered by noise. Noise and signal of the raw data (Figs. 15a(I) and 17c(I)) do not differ in amplitude, which results in an SNR close to zero. Also, the bandpass filter neither improves the SNR nor makes the event visible (Figs. 15a(II) and 17c(II)). The time-frequency representations of the raw data and the filtered data do not give a hint SNR CWT = 20.6 (± 8.0) dB SNR STFT = 20.6 (± 8.6) dB

Fig. 11
Distribution of the improvement in SNR for either DAE with CWT (blue bars) or DAE with STFT (green bars).
(a) Improvement of the noisy test dataset. (b) Improvement of the real dataset at station RN.BAVN. Values indicate the average SNR, with the first standard deviation in parentheses as to when the event might be. Nonlinear thresholding also fails ( Fig. 15(III) and (IV)), no matter which method is used. Only the DAEs are able to distinguish between noise and signal and make the event visible and detectable (Figs. 15(V), (VI) and 17c (V), (VI)). Figures 12 to 17 show the vertical component of the recorded data only. Usually, seismometers record in three directions: one vertical component and two horizontal components in east-west and north-south direction, respectively. However, the horizontal components can be processed in the same manner since the DAE learns to denoise all three components during training. Figure 18 shows an example of a three component seismogram. The left column (Fig. 18a) presents the raw data and the middle and right column the denoised data using the DAE using STFT (Fig. 18b) and the DAE using CWT (Fig. 18c), respectively. The upper row shows the vertical (Fig. 18(I)), the middle row the north-south (Fig. 18(II)), and the lower row the east-west component ( Fig. 18(III)). The denoising results do not differ significantly when the horizontal components are  included, and we are able to determine the arrival of the S-phase, which can be recognised by the high amplitudes in the horizontal components.

Discussion and conclusions
The objective of this work is the suppression of interfering WT noise in seismological recordings using different noise suppression techniques. For this purpose, we calculated wind-dependent PSD curves at several stations in NRW in order to find a suitable seismological station that is degraded by the installation of new WTs. We found a station close to Haltern am See, where six WTs are in the vicinity of one seismological station (Fig. 2). As a first method, we used nonlinear thresholding to suppress the disturbing noise by defining a threshold function (Fig. 7) either from a noise window before an event or from a previously calculated noise model. To obtain a noise model, we first installed additional seismometers at each WT to record the noise directly at the source, extracted similar waveforms at each WT using Euclidean distance (Figs. 4,5,and 6), and used the data recorded at the seismological station RN.BAVN at the time when similar signals were found at the WTs as our noise model (Fig. 6). As a second method, we used a DAE which learns from a training dataset how to distinguish between signal and noise. As a training dataset, we used local earthquakes from STEAD ) and took noise samples from the affected seismological station. Furthermore, we tested two different time-frequency transformations (STFT and CWT) for the DAE. Zhu et al. (2019) and Tibi et al. (2021) only used the STFT which has a lower time-frequency resolution than CWT. We applied a bandpass filter between 5 and 20 Hz to compare the different denoising techniques. The results on real seismological data  show that nonlinear thresholding using pre-noise well suppresses the noise, when the event is detectable in the raw or bandpass filtered data. This is consistent with the findings of Mousavi et al. (2016) and Langston and Mousavi (2019). Using pre-noise Fig. 14 The same as in Fig. 12 but for different waveforms with a lower SNR in the raw data (a(I)). Magnified windows of the P-and S-phase onsets are shown in Fig. 17a  instead of calculating a noise model from the distance profile is more efficient in suppressing noise. Furthermore, the estimation of a noise model is more complex, since additional seismometers at the WTs are required and it is necessary to determine a noise model before denoising an event, which is time consuming. Using pre-noise is therefore superior (to the noise-model determination) as it can almost be done in real time. When denoising events where signal and noise have similar amplitudes (Fig. 14) or the event is fully covered by noise and can neither be recognised in the waveform nor in the time-frequency representation, nonlinear thresholding always fails for our examples. We have found that the resulting waveforms contain many spikes, making them unusable for further seismological analysis. In our shown examples, the DAE is always able to remove the noise from the raw data and make the event visible in both the time and time-frequency domain. To test our DAE models, we constructed artificial noisy signals by randomly adding signal and noise from our dataset, which was not used in the training process ( Fig. 10a and b). Using STFT or CWT as a time-frequency transform does not result in a big difference in the denoised data. Also both trained DAE models show similar results, when they are applied on the real dataset recorded between 2017 and 2020, which contains 2421 events ( Fig. 10c and d). However, for both the test dataset (Fig. 11a) and the real dataset (Fig. 11b), some data are not improved or in some cases noise remains before the P onset (Figs. 16b(VI), 17a (V) and (VI), and 17c (V) and (VI)). This might result from a too high noise level in the constructed noisy dataset or in the real dataset, so that the DAE is not able to remove the noise, i.e. the DAE classifies the selected time window as noise and is not able to find any signal. Here, a too high noise level cannot be specified because the results (Fig. 10-18) do not show systematic limits of the DAE. Finding these limits needs further investigations.
In conclusion, there is no big difference when using CWT or STFT as time-frequency transformation. Furthermore, since CWT results in a larger size for the input layer, the training of the DAE is more time consuming than for STFT. However, both presented DAEs are able to suppress WT noise and they are also applicable on three component data (Fig. 18). The DAE is the only one of the methods investigated that is capable of separating noise and signal well, especially when signal and noise share a common frequency band.

Fig. 17
Magnified windows around the P-and S-phase onsets (red and blue lines) in Figs. 14 (a, b) and 15 (c). The P-phase from Figure 14 is shown in (a) and the S-phase in (b), respectively for signal and noise. Furthermore, the mask function only has values between [0, 1], and thus, the DAE is not able to recover signal values which are larger than the noisy signal ). However, both DAEs do not lead to time shifts or signal deformations and the polarity is not changed (Figs. 16 and 17). How the DAE affects earthquake localisation, magnitude estimation, and other seismological analyses needs to be further investigated, especially when earthquake data are only visible on stations after denoising. The use of a DAE could therefore lead to more earthquake detections. As a future improvement, one could assume a real and an imaginary output layer for the signal instead of two mask functions for signal and noise, respectively. The DAE is then trained by minimising the error between the true and the predicted signal and may be able to learn more from the phases of the spectrogram.
Once a DAE model has been trained for a station, it can be used as long as the noise at the station does not change. When the noise at the station changes, new training might be necessary. We also tested the DAE on real time denoising. To do this, we split the incoming data from the seismological station into overlapping 60-s windows, denoise each window, and merge the individual windows back into one trace using ObsPy (Beyreuther et al. 2010). Afterwards, the denoised trace can be analysed by common seismological software packages. In general, the DAE can be trained with any type of noise depending on the application. At the moment, the training dataset used  consists only of local earthquake data. Adding regional and teleseismic data to STEAD could be a future improvement. owners of the wind turbines near Haltern am See (North Rhine-Westphalia, Germany) for their cooperation, sharing of the wind turbines' operational data, and permission to install new seismometers. Many thanks to the seismology working group at the Ruhr-University Bochum and the anonymous reviewers for proofreading this work and their helpful recommendations.
Funding Open Access funding enabled and organized by Projekt DEAL. The MISS research project was funded by the "Europäische Fonds für regionale Entwicklung (EFRE)".

Conflict of interest
The authors declare no competing interests. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.