1 Introduction

Studying variations in ground motion amplitudes as a function of distance and frequency is essential for accurately assessing seismic hazards and risk levels densely populated urban areas. In this regard, kappa (κ) is a well–known and practical parameter that characterizes the exponential decay of the high–frequency spectral attenuation of shear waves. It has been used for various purposes, including stochastic simulations, regional adjustments of ground motion models (Atik et al. 2014; Bora et al. 2017), and correlations with site proxies, such as the average shear wave velocity of the top 30 m subsoil (Vs30) (van Houtte et al. 2014; Ktenidou et al. 2015; Cabas et al. 2017).

κ values exhibit regional characteristics, necessitating local κ estimates from past earthquake data for region–specific applications. κ studies regarding sites located in Türkiye include predictive models and estimations for the following regions. In Northwestern Türkiye, Askan et al. (2014) proposed a regional κ model for soil sites using a multivariate linear regression approach. Kurtulmus and Akyol (2015) analyzed ground motion data acquired from local small and moderate earthquakes to estimate regional κ values for central west Türkiye. Tanırcan and Dikmen (2018) predicted κ values for downhole arrays and bedrock sites in western Istanbul using small earthquake records. In their latest study, Biro et al. (2022) performed κ estimations for the eastern Anatolian region in Türkiye using data from a single fault segment to mitigate the source effects in κ values. Altindal and Askan (2022) proposed regional κ models for Eastern, Western, and Northwestern Türkiye using two different modeling approaches within a probabilistic framework. Sertcelik et al. (2022) determined κ values for 161 strong–ground motion stations operated by Türkiye’s Disaster and Emergency Management Authority Presidential of Earthquake Department in the Marmara Region.

Since the pioneering work of Anderson and Hough (1984), understanding what causes the amplitudes of Fourier spectra of strong–ground motion to begin to decay with increasing frequency in the high‐frequency range (f > fe) remains a challenge in engineering seismology. Previous studies have alternating views on the source and site dependency of κ (Petukhin and Irikura 2000; Tsai and Chen 2000; Ktenidou et al. 2013). Significant research has been published focusing on deriving predictive models for κ as a function of earthquake magnitude, the type of source mechanism, source–to–site distance, and site class (Anderson and Hough 1984; Purvance and Anderson 2003; Motazedian and Atkinson 2005; Drouet et al. 2010; Iwakiri and Hoshiba 2012). In most of these applications, κ is assumed to be a linear function of distance and site effects, where a zero–distance kappa value (κ0) is defined to eliminate path dependencies. Ktenidou et al. (2014) discussed that κ is mostly site dependent; however, to represent the physical phenomena comprehensively, parameters regarding the source and epicentral distance effects could also be included in the predictive models.

The relationship between the independent variables and κ may not be sufficiently described using traditional linear regression models because of considerable scattering in the individual κr values. Along with the potential nonlinearities of site response influencing κ at the recording site, a nonlinear variation of the observed κ with distance is possible. Sotiriadis et al. (2021) examined nonlinear regression analyses to relate κ, as computed for free–field and foundation motions, to magnitude, source–to–site distance, and VS30. Recently, Gičev and Trifunac (2022) stated that the nonlinear behavior of soil and rock materials should be considered when predicting κ from greater than M > 5 events, and they proposed a physical model that κ can be associated with nonlinear site response. Moreover, various researchers have determined the average κ along the S–wave source–station paths using nonparametric inversion approaches (e.g. Anderson 1991; Fernández et al. 2010; Castro et al. 2022).

An effective approach to estimate the complex nature of the κ parameter, influenced by multiple interrelated parameters, involves the use of robust data–driven methodologies such as Multivariate Adaptive Regression Splines (MARS). MARS is a nonparametric, multistage regression method that captures nonlinear interactions between variables in several dimensions (Friedman 1991). The primary advantage of MARS is its ability to accurately estimate the contributions of the input factors to the response variable. The algorithm automatically identifies the most salient qualities within the dataset and incrementally incorporates terms into the model. Each term represents a primary or interaction effect of a piecewise linear function that is included in the model if it improves the model’s goodness of fit. The approach iteratively incorporates additional terms until the model reaches a point where further improvement is no longer observed. MARS is a well–known statistical methodology that has been used in numerous susceptibility evaluations of natural disasters such as floods, landslides, and wildfires (e.g. Felicísimo et al. 2013; Conoscenti et al. 2015; Hai et al. 2023). Its effective application in several domains of geotechnical engineering, such as determining the elastic modulus of rocks (Samui 2013), analyzing soil liquefaction (Zhang et al. 2015; Zheng et al. 2020), assessing soil and wall characteristics in braced excavation (Zhang et al. 2017), and evaluating slope reliability (Deng et al. 2021), has been well documented. To gain a thorough understanding of the MARS technique and its practical applicability to geotechnical problems, please refer to Zhang (2020). The recent trend involves integrating AI–based techniques and other data–driven methods with the MARS methodology to achieve higher accuracy, improved interpretability, and valuable insights in diverse scientific fields (e.g. Zhang et al. 2021, 2022; Asare et al. 2023; Vaheddoost et al. 2023; Hong et al. 2024). Therefore, it may be possible to mitigate the inherent limitations of a singular technique, such as insufficient memory, overfitting, and uncertainty, by using a convenient hybrid approach.

The recent seismic activity on the west coast of Türkiye, including the Aegean Sea region, indicates that a closer focus is necessary on this region. Located in an active tectonic regime of North–South extension with multiple basin structures on soft soil deposits, the Aegean Sea region in Türkiye has a high seismic hazard. Recently, as a combination of basin effects and building vulnerability, on October 30, 2020, the Samos event (Mw = 7.0) caused localized major damage and collapses in the Izmir city center despite the 70 km distance from the earthquake source. Despite this activity, studies on site characterization and site response modeling, including local velocity models and κ estimates, are still limited in this region (Akyol et al. 2013; Kurtulmus and Akyol 2015; Pamuk et al. 2018, 2019).

This study attempts to develop predictive κ models for western Türkiye with a particular focus on the Aegean Sea region. We used three–component strong–ground motion data from accelerometer stations with known VS30 values throughout western Türkiye, including Aydın, Denizli, Izmir, Kutahya, Manisa, and Mugla in the Aegean region, as well as Balikesir and Canakkale provinces. Prediction models were built using multiple linear regression (MLR) and MARS techniques. According to the literature, MARS has not been previously used in the modeling of high–frequency spectral attenuation of shear waves. The impacts of epicentral distance Repi, magnitude Mw, and VS30 (site class) are explored, and the contributions of each parameter are analyzed using a large dataset that includes recent seismic activity. In κ assessment, models are validated using well–known accuracy metrics such as correlation coefficient (r), R–square, adjusted R–squared (Adj. R2), mean absolute percentage error (MAPE), and mean squared error (MSE).

2 Study area and tectonic framework

The Aegean Sea and the basin–and–range system in the western part of Türkiye formed as a result of a long series of tectonic regimes, including the collision between the Arabian and Eurasian plates and the northward movement of the African lithosphere beneath Eurasia (Jolivet et al. 2013; Şengör and Yazıcı 2020). The collision– and subduction–related mechanisms have led to the Anatolian block’s westward escape along the right–lateral North Anatolian and the left–lateral East Anatolian Fault Zones (NAFZ and EAFZ, respectively) and the subsequent African slab roll–back and trench retreat along the Aegean–Cyprian Subduction Zone (Şengör et al. 1985; Dewey et al. 1986; Bozkurt 2001; Le Pichon and Kreemer 2010; Barbot and Weiss 2021). As a result, the continental crust has expanded and volcanism has occurred in the overlying Aegean–west Anatolian region (Fig. 1) (Jolivet et al. 2013). Although the origin and evolutionary history of the principal extensional features remain a matter of debate, western Türkiye is characterized by a complex array of grabens and basins bounded by large–scale extensional detachments and high–angle normal faults. (Brun and Sokoutis 2007; Çiftçi et al. 2010). The horst–graben architecture arises in different geographical trends, including EW–trending grabens (e.g. from the Gulf of Edremit to the north and toward the Gulf of Gökova in the south), NE–trending depressions (e.g. Gördes basin, Acıpayam and Burdur grabens), and intervening horsts, respectively (Bozkurt and Mittwede 2005; Çiftçi et al. 2010). All basins contain the Menderes core complex, Lycian nappes, and Tauride platform basements and are filled with lower–middle Miocene to Holocene continental deposits of more than 3000 m in some locations (Çiftçi et al. 2010; Özkaptan et al. 2021).

Fig. 1
figure 1

Location map showing the main tectonic elements of the Aegean and Anatolian regions (adapted from Barbot and Weiss (2021)). The black arrows indicate the direction of plate movement relative to Eurasia. The east–to–west increase in the size of the arrows reflects the documented 20 mm/yr increase in surface velocity. The dashed black box denotes the location of the study area. The research region was divided into smaller sections as A1, A2, and A3 in accordance with the major seismotectonic domains identified by Duman et al. (2018). The major tectonic features on the map include BZS Bitlis–Zagros Suture, CAFZ Central Anatolian Fault Zone, DSTF Dead Sea Transform Fault, EAF East Anatolian Fault, EFZ Ezinepazari Fault Zone, KTF Kefalonia Transform Fault, MAF Movri–Amaliada Fault, MOF Malatya–Ovacik Fault, NAF North Anatolian Fault, NAT North Anatolian Trough, and TIP Turkish–Iranian Plateau

As one of the areas with the highest intensity and frequency of seismic activity and the most severe seismic and geological hazards on the Turkish mainland, numerous strong earthquakes of magnitude ≥ 6 have been observed in western Türkiye over the past ten years, such as the 2014–M 6.9, Gökçeada; 2017–M 6.2 Karaburun; 2017–M 6.4 Kos; 2019–M 6.0, Denizli; and 2020–M 7.0 Samos earthquakes. Large–magnitude earthquakes have also occurred in the study area since 1900, e.g. 1919–Soma M 6.9, 1928–Torbalı M 6.3, 1933–Gökova M 6.8, 1956–Söke–Balat M 7.1, 1969–Alaşehir M 6.5, and 1970–Gediz M 7.2 earthquakes. Spatially distributed, relatively frequent, moderate magnitude–crustal earthquakes are concentrated in the western Anatolia graben systems, where normal faulting mechanisms are associated with extensional tectonics (Duman et al. 2018). The seismic pattern represents swarm–type activity with remarkable clusters (e.g. the Urla earthquakes swarm with M 5.0, 5.8, 5.5, and 5.9). Most of them were recorded as shallow and restricted to a seismogenic zone within the upper 30–40 km of the crust, except for the events that occurred in the southern Aegean area, which reached a depth of approximately 180 km (Bocchini et al. 2018). The bulk of the moderate–to–large magnitude events accumulated near the plate boundaries, which correspond to the northern and southern margins of the study area. Recent seismicity studies have identified that the dominant strike–slip stress regime is concentrated on the western branch of the NAFZ and its continuation in the Marmara Sea and the North Aegean Trough (Vamvakaris et al. 2016). This confirms the co–existence of active extension and strike–slip deformation along the coastal region of western Anatolia and the eastern Aegean Sea (Kiratzi et al. 2021). Moreover, the southern Aegean domain includes large thrust earthquakes on adjacent Hellenic and Cyprus Arc segments as well as a mixture of normal, strike–slip, and splay–thrust faulting events.

In this study, we separated the study area into smaller parts on the basis of criteria mainly related to the spatial distribution of earthquakes and similar types of fault zones identified in the region. Using the major seismotectonic domains proposed by Duman et al. (2018), we divided the study area into three smaller sections: (1) the western branch of the NAFZ as Area 1 (A1), (2) the western Anatolia graben systems as Area 2 (A2), and (3) the Aegean arc as Area 3 (A3). Thus, we only selected data with epicentral distances less than 200 km to avoid additional path complications.

3 Data

We compiled three–component strong–ground motion records obtained with 100 samples per second at accelerometer stations within the Turkish National Strong Ground Motion Observation Network operated by the Disaster and Emergency Management Presidency (AFAD). Raw versions of the data are available on the Turkish Accelerometric Database and Analysis System (TADAS, via https://tadas.afad.gov.tr). Records were selected on the basis of stations with known VS30 values, including Aydın, Denizli, Izmir, Kutahya, Manisa, and Mugla in the Aegean region and Balikesir and Canakkale provinces in western Türkiye (Fig. 2). The station details for all sites are listed in Supporting Information Tables S1, S2, and S3, respectively. The multichannel analysis of surface waves (MASW) method was used to calculate station velocity profiles and mean VS30 values (Sandikkaya et al. 2010). Further details on the site characterization of these stations are available on the Strong Ground Motion Database of Türkiye (via https://deprem.afad.gov.tr). Because earthquake magnitudes have been reported on several scales, including the duration, local, and moment magnitude scales, we converted all of them to Mw, applying the conversion equation of Kadirioğlu and Kartal (2016) for consistency.

Fig. 2
figure 2

Locations of the stations (triangles) and the distribution of earthquakes (circles) were used in this study. The yellow triangle–purple circle, black triangle–red circle, and red triangle–yellow circle pairs correspond to the strong–ground motion station and earthquake couples used in κ calculations for A1, A2, and A3. The blue star represents the epicenter of the Mw 7.0 Samos earthquake that occurred on October 30, 2020. The blue beachball displays the focal mechanism solution provided by USGS

Figure 3a displays pairwise relationships among all selected independent variables, including epicentral distance (Repi), depth, magnitude (Mw), and κ for whole data set. Our dataset consists of 1398 records (4194 components) measured at 58 strong–ground motion stations from 408 earthquakes recorded between July 2003 and November 2018. The data cover earthquakes with Mw ranging from 3 to 6.5. Most of the records relate to events within Mw range of 3.5–4.5, while the remaining data are evenly distributed to larger earthquakes. Moreover, a few earthquakes with Mw over 5.5 are also included in the dataset. The epicentral distances range from 5.0 to 200 km, with most being between 10 and 100 km. The number of records decreases as Repi increases. The strong–motion data cover various focal depths, from 1 to 78 km. Additional information regarding the earthquakes can be found in the Supporting Information (see Tables S4, S5, and S6 for each area, respectively).

Fig. 3
figure 3figure 3

Pairwise correlations among multiple variables including Repi, depth, Mw, and κ in the following: a whole data set, b A1, c A2, and d A3. Each diagonal also includes histograms that demonstrate the number of records within the independent variable bins

Because the study area is divided into three smaller parts, all computations have been separately performed for these areas. Statistical distributions of recordings with respect to Repi, Mw, focal depths, and κ can be seen in Fig. 3b, c, and d for the A1, A2, and A3 areas. These figures also include histograms showing the number of records within the independent variable bins. The size of the datasets is approximately the same for A1 and A3, whereas it is almost six times larger for A2. It can be easily seen that the distance distribution is reasonably homogeneous for the three regions. The selected data subsets indicate that there is no obvious trend with either Repi, depth, or Mw, indicating that no bias is included during the κ calculations. The bulk of the data, which substantially originates from the central–western part of Türkiye, consists of small—to moderate–magnitude events (up to M < 5.5) and shallow depths within the range of 1–30 km, whereas significant events occurring at depths greater than 30 km originating from A3 are recorded.

We determine κ values using the original method proposed by Anderson & Hough (1984), where the high–frequency spectral decay is modeled as follows (e.g. Purvance and Anderson 2003; Castro et al. 2022; Lanzano et al. 2022):

$$A\left(f\right)={A}_{0}{e}^{-\pi \kappa f}, f>{f}_{e}$$
(1)

where the amplitude A0 depends on the source and path properties and fe is the frequency above which the spectrum is approximated with a linear decay on a log (Amplitude) versus linear frequency plot.

Previous studies have mainly focused on the examination of horizontal κ values. However, it is critical to investigate the variability in κ values obtained from all three components of a single record for certain purposes. Stochastic ground motion simulations may require vertical κ values to adjust the empirical horizontal–to–vertical curves used for site amplifications (Motazedian 2006). When three–component stations are not available, κ values obtained from the vertical components can be used as an initial estimate for this parameter. However, it may be necessary to make slight adjustments to the κ values derived from the vertical component (Douglas et al. 2010). In this study, the North–South (NS), East–West (EW), and vertical (Z) components for each triaxial acceleration record are processed individually. We initially corrected the baseline for each time series by subtracting the linear trend. The lengths of the S–wave and noise windows, which vary between 3 and  40 s depending on the Mw and source–to–site distances, are manually decided for each record for precision and consistency. Following 5% cosine tapering at both edges of the time window, the Fourier amplitude spectrum (FAS) of the S–wave (and noise signal) is computed and plotted in log–linear space after smoothing between 0.3 and 35 Hz. Then, κ factor is computed from the FAS by fitting a straight line with least–squares linear regression where the spectral decay starts (fe) and ends (fx). We note that the lower and upper frequency bounds for the exponential decay are determined manually. While selecting these frequencies, the method requires satisfying the following conditions to avoid bias in κ estimates: the frequency range chosen (1) must lie above the corner frequency (fc) due to the relation between seismic moment and fc, (2) should supply adequate bandwidth to ensure robustness to the regressions and to resolve source, path, and site effects where the instrument’s response is considered flat (e.g. Ktenidou et al. 2017), and (3) should not exceed the high–frequency noise level (e.g. Douglas et al. 2010; Ktenidou et al. 2013).

On the basis of these criteria, we rejected events with Mw lower than 3.5 to avoid biased–κ estimations caused by the source effect on the measurements. We calculated empirical fc values of Mw varying from 3.5 to 6.5 using the theoretical relationship among seismic moment, stress drop, and corner frequency described by Brune (1970). For Mw 3.5, we obtained typical estimates of fc ranging from 1.36 to 6.32 Hz with an average shear–wave velocity in the crust of 3.5 km/s and variable stress drop values between 1 and 100 bars. Thus, fe points greater than 6.5 Hz were selected for the events of Mw between 3.5 and 4 to limit the impact of fc. Further details on the calculations can be found in Ktenidou et al. (2017). We also ensure that fe > fc on the displacement amplitude spectra in the log–log scale. Thus, we ensure fe exceeds fc for all considered events. Several studies have shown that resonance peaks in site transfer functions may cause inaccurate selection of the initial frequency value or high–frequency site amplification, leading to significant overestimation or underestimation κ (Parolai and Bindi 2004; Kishida et al. 2014). Therefore, empirical transfer functions estimated through the horizontal–to–vertical spectral ratio for the stations considered here are studied to confirm that the fundamental frequencies are consistently below fe, where the exponential decay starts. In selecting fx, we analyzed the signal–to–noise ratios (SNR) and used only data with SNR above 3. A value of fx equal to 35 Hz (70% of Nyquist frequency as a limit) was used to guarantee an entire frequency range in which the instrument has a flat response to ground acceleration. Furthermore, we only considered ground motions within a Repi of 200 km to minimize the potential for capturing multiple ray paths because of the structural complexity of the region (Kurtulmus and Akyol 2015). Finally, we selected stations with at least 15 or more suitable ground motion records per station.

The average value of fe is around 10 Hz for both horizontal components and 12 Hz for the vertical one, but it has a wide range between 3 and 25 Hz for all components, which is consistent with previous studies such as (Anderson and Hough 1984; Douglas et al. 2010). Here, we observe that fe is smaller for large earthquakes than for moderate and small earthquakes, and this could be related to a seismic moment dependency as discussed by Tsurugi et al. (2020). The value of fx was found to vary between 20 and 35 Hz. Figure 4 displays the sample κ calculation, including the S–wave and noise spectra. The κ values eventually determined from the NS and EW components are then averaged to yield a single horizontal κ value per event and station to reduce the influence of propagation heterogeneity. Askan et al. (2014) stated that such an averaging assumes that the direction of the incoming waves does not affect the κ value. To improve our analyses, we excluded individual κ calculations with a difference greater than 25% between the 2 horizontal components (e.g. van Houtte et al. 2011; Ktenidou et al. 2013).

Fig. 4
figure 4

Data from an Mw 5.1 earthquake on May 27, 2017, at 15:53:23. Acceleration time–series and their corresponding spectra from station 3509. a Three–component records of S–wave and noise signals are highlighted and labeled in the time–series plot. b Examples of the FAS of the S–window (the upper jagged line) and the noise signal in gray (lower jagged line) for each component. Vertical dashed lines represent the picking of fe, the lower bound, and fx, the upper bound on the FAS. The black line represents the fit to the spectrum over the frequency band for the κ measurement

4 Methodology

4.1 Multiple linear regression (MLR) model

We employed a regression model to determine the correlation between κ estimates and independent variables, such as Mw, Repi, and Vs30. This approach is widely accepted and frequently used in data–driven mathematical analysis, although it has several variations. The MLR model is extensively preferred in many applications because of its well–established form and available computer packages. It is generally used to construct a relationship between selected independent (explanatory) variables and a dependent (response) variable. The standard MLR form (Neter et al. 1996), with N observations and p explanatory variables (x1, x2, …, xp), is given as follows:

$${y}_{i}={\beta }_{0}+{\beta }_{1}{x}_{i1}+{\beta }_{2}{x}_{i2}+\dots +{\beta }_{p}{x}_{ip}+{\varepsilon }_{i} , \left(i=1, 2, \dots ,N\right)$$
(2)

The matrix–vector notation of this model can also be expressed as follows:

$$\left[\begin{array}{c}{y}_{1}\\ {y}_{2}\\ \begin{array}{c}\vdots \\ {y}_{N}\end{array}\end{array}\right]=\left[\begin{array}{ccc}\begin{array}{ccc}1& {x}_{11}& {x}_{12}\\ 1& {x}_{21}& {x}_{22}\end{array}& \cdots & \begin{array}{c}{x}_{1p}\\ {x}_{2p}\end{array}\\ \vdots & \ddots & \vdots \\ \begin{array}{ccc}1& {x}_{N1}& {x}_{N1}\end{array}& \cdots & {x}_{Np}\end{array}\right]\left[\begin{array}{c}{\beta }_{0}\\ {\beta }_{1}\\ \begin{array}{c}\vdots \\ {\beta }_{p}\end{array}\end{array}\right]+\left[\begin{array}{c}{\varepsilon }_{1}\\ {\varepsilon }_{2}\\ \begin{array}{c}\vdots \\ {\varepsilon }_{N}\end{array}\end{array}\right]$$
(3)

where

$${\varvec{y}}=\left[\begin{array}{c}{y}_{1}\\ {y}_{2}\\ \begin{array}{c}\vdots \\ {y}_{N}\end{array}\end{array}\right], {\varvec{X}}=\left[\begin{array}{ccc}\begin{array}{ccc}1& {x}_{11}& {x}_{12}\\ 1& {x}_{21}& {x}_{22}\end{array}& \cdots & \begin{array}{c}{x}_{1p}\\ {x}_{2p}\end{array}\\ \vdots & \ddots & \vdots \\ \begin{array}{ccc}1& {x}_{N1}& {x}_{N1}\end{array}& \cdots & {x}_{Np}\end{array}\right],{\varvec{\beta}}=\left[\begin{array}{c}{\beta }_{0}\\ {\beta }_{1}\\ \begin{array}{c}\vdots \\ {\beta }_{p}\end{array}\end{array}\right],{\varvec{\varepsilon}}=\left[\begin{array}{c}{\varepsilon }_{1}\\ {\varepsilon }_{2}\\ \begin{array}{c}\vdots \\ {\varepsilon }_{N}\end{array}\end{array}\right]$$
(4)

In short:

$${\varvec{y}}={\varvec{X}}{\varvec{\beta}}+{\varvec{\varepsilon}}$$
(5)

where y is an (N × 1) vector of the response variable, X is an (N × (p + 1)) matrix of explanatory variables, β is a ((p + 1) × 1) vector of unknown parameters, and ε is an (N × 1) vector of independent, identically distributed random errors. In this approach, to estimate unknown parameters, the least squares estimation method is used.

4.2 Multivariate adaptive regression splines (MARS) model

The commonly used MLR requires certain assumptions to be fulfilled. Therefore, its prediction power is restricted to the estimation of general functions of high–dimensional and sparse data. In this study, we used a sophisticated nonparametric regression method called MARS, which makes no specific assumption about the underlying functional relationship between the response and input variables (Friedman 1991). The MARS model is represented by a linear combination of the basis functions (BFs) and the intercept as follows:

$$y={\theta }_{0}+\sum_{m=1}^{M}{\theta }_{m}{B}_{m}\left({x}^{m}\right)+\varepsilon$$
(6)

where \({B}_{m}\hspace{0.33em}(m=\mathrm{1,2},...,M)\) are BFs taken from a set of M linearly independent basis elements, and \({\theta }_{m}\) are the unknown coefficients for the mth BF (\(m=\mathrm{1,2},...,M\)). BFs can be in the form of the main or interaction of two or more spline functions. The following piecewise linear functions given in Eq. (7) that involve one independent variable with a knot value \(\tau\) show main effects and are reflected pairs of each other:

$${c}^{+}\left(x,\tau \right)=max\left\{0,x-\tau \right\}, {c}^{-}\left(x,\tau \right)=max\left\{0,\tau -x\right\}$$
(7)

Figure 5 shows an illustration of these BFs with a knot value at τ = 3.

Fig. 5
figure 5

Pair of main BFs with a knot at τ = 3

Because the MARS algorithm is an adaptive procedure, the selection of BFs is data–based and specific to the problem at hand. A special advantage of MARS is that it is also possible to estimate the response by the contributions of the BFs with the interaction effects of the independent variables. For a given data \(\left( {\overline{x}_{i} ,\;\overline{y}_{i} } \right)\;\;\left( {i = 1,2,...,N} \right)\), the form of the mth BF with the interaction effect is as follows:

$${B}_{m}\left({x}^{m}\right):=\prod_{j=1}^{{K}_{m}}{\left[{S}_{{\kappa }_{j}^{m}}.\left({x}_{{\kappa }_{j}^{m}}-{\tau }_{{\kappa }_{j}^{m}}\right)\right]}_{+}$$
(8)

where \([q{]}_{+}:=max\left\{0,q\right\},\) \({K}_{m}\) is the number of truncated piecewise linear functions multiplied in the mth BF, \({x}_{{\kappa }_{j}^{m}}\) is the input variable corresponding to the jth truncated piecewise linear function in the mth BF, \({\tau }_{{\kappa }_{j}^{m}}\) is the knot value corresponding to the variable \({x}_{{\kappa }_{j}^{m}}\), and \({s}_{{\kappa }_{j}^{m}}\) is the selected sign + 1 or − 1. Figure 6 graphically represents an example of interaction BF obtained based on two predictor variables (x1 and x2) and their knot values (\({\tau }_{1}=0.5\) and \({\tau }_{2}=0.1\)) respectively.

Fig. 6
figure 6

Graphical representation of the interaction basis function based on   x1 and  x2 predictor variables

The smoothness constraints in MARS are implemented using BFs to which a piecewise linear function is linked, as defined in Eq. (7). This function governs the contribution of a specific basis function over different parts of the input space. Smoothness constraints are inherent properties of these functions. The following factors affect the level of smoothness in the MARS–generated κ functions:

  • Number of Basis Functions: increasing the number of basis functions allows the model to capture complex features and non–linear patterns in the data, but it may result in a less smooth overall fit.

  • Number of Knots: an insufficient number of knots may result in an overly simplistic model, whereas an excessive number of knots might lead to a less smooth function.

  • Interaction Terms: MARS allows the use of interaction terms, but it is important to note that multiple interactions may result in a less smooth response function.

  • In the backward step of the MARS algorithm, the degree of pruning affects the overall smoothness of the response function.

The MARS algorithm consists of two parts (Friedman 1991): the forward stepwise algorithm searches for the BF, and at each step, the split that minimizes the residual sum of squares (RSS) criterion from all possible splits on each BF is chosen. The search for new BFs is restricted by the user–specified value Mmax and the maximum interaction order. In the backward stepwise algorithm, on the other hand, the purpose is to prevent over–fitting by decreasing the complexity of the model without degrading the fit to the data by removing some BFs that contribute less to the model (i.e. the smallest increase in the RSS). MARS algorithm uses a generalized cross–validation (GCV) criterion to estimate an optimal number of parameters in the model. GCV is used in both parts of the MARS algorithm and is defined as:

$$GCV=\frac{1}{N}\frac{{\sum }_{i=1}^{N}{\left({y}_{i}-{\widehat{f}}_{M}\left(\theta ,{x}_{i}\right)\right)}^{2}}{{\left(1-Q\left(M\right)/N\right)}^{2}}$$
(9)

where the numerator is the RSS and Q(M) in the denominator represents the cost penalty measure of a model with M BFs. The final MARS model is obtained when the minimum value of the GCV is reached.

Summaries of the advantages of the MLR and MARS models used in this study are as follows: MLR is a statistical model used to analyze the relationship between a dependent variable and multiple independent variables. This model is simple and easy to understand. Compared with more complex models, linear regression models have better training and forecasting speeds. However, it is crucial to note that MLR may not be suitable for complex relationships or situations that involve significant non–linearities. Conversely, MARS excels at capturing non–linear correlations in the data. Unlike some complex machine learning models, it provides an interpretable and transparent representation of the relationship between the predictors and the response variable. The explicit functional form, which includes linear segments and breakpoints, allows for a precise understanding of the model’s behavior. During model construction, the MARS algorithm identifies the most significant variables and their relationships while disregarding the less impactful ones. This can result in models that are more comfortable and easier to understand, and better generalizations to new data.

5 Application and results

The findings of the MLR and MARS models for predicting κ values in western Türkiye were analyzed to validate the study. The majority of research in the existing literature illustrates the correlation between κ and distance. In this area, where most earthquakes occur in the shallow crust, various earthquake distance measurements, such as epicentral and hypocentral distances, do not have a significant impact. We experimented with several statistical metrics and observed that the outcomes did not differ substantially. Hence, we have chosen to use Repi as the main distance metric in the models. A limited number of studies argue that magnitude could influence κ as well, implying the existence of source–related effects on high–frequency spectral amplitudes (e.g. Purvance and Anderson 2003). To investigate the potential influence of magnitude, this study employs the MLR and MARS methodologies, where Mw is considered an independent variable in the κ model predictions. Moreover, the Vs30 value, denoting the time–averaged shear wave velocity within the initial 30 m of depth, is the most common proxy for site classification globally. Therefore, in this study, Repi, Mw, and Vs30 are input variables included in the evaluation process. Tables 1, 2, 3 present the descriptive statistics for all variables related to each area. The tables show that the variables have a varied range of values. Hence, all input data are normalized so that each has a mean of zero and a variance of one to overcome variable scale impacts on the findings while performing MLR and MARS modeling. Some well–known statistical performance indicators, such as the correlation coefficient (r), R–square, adjusted R–squared (Adj. R2), mean absolute percentage error (MAPE), and mean squared error (MSE), were generated to assess the effectiveness and accuracy of the models. The following section compares the subsets based on the model metrics.

Table 1 Descriptive statistics of variables for A1
Table 2 Descriptive statistics of variables for A2
Table 3 Descriptive statistics of variables for A3

After the MLR and MARS prediction models were built for A1, some measures for the prediction performance of the models were computed, as listed in Table 4. The following results show that the MARS model performs quite well for the given dataset. Here, the smaller values of MSE and MAE and the higher values of R2, Adj.R2, and r show better performance of the associated model. Moreover, scatter plots of the prediction models for both the MLR and MARS models are shown in Fig. 7 shows the performances by using the correlation between predicted and actual values. This figure indicates that the MARS model results for both the horizontal and vertical components have a better prediction of true values. The BFs in the MARS model use all predictor variables in the following order of importance: Repi, Vs30, and Mw.

Table 4 Comparison of performance measures for MLR and MARS in A1
Fig. 7
figure 7

Comparison of MLR and MARS predictions for A1. All plots show actual κ vs predicted values. Circles and squares correspond to the horizontal and vertical components, respectively. Blue lines represent the case when predicted κ equals to actual κ

If we compare the MARS and MLR performance results for A2, the prediction capability of the MARS model is again stronger and the model captures the main structure of the given dataset well (see Table 5 and Fig. 8). The variable importance of the MARS model obtained according to the use of variables in the model is Repi, Vs30, and Mw in order.

Table 5 Comparison of performance measures for MLR and MARS in A2
Fig. 8
figure 8

Comparison of MLR and MARS predictions for A2. All plots show actual κ vs predicted model values. Circles and squares correspond to the horizontal and vertical components, respectively. Blue lines represent the case when predicted κ equals to actual κ

Similar results were obtained for A3. Although the form and size of the datasets are changed, the MARS model performance is much better than that of MLR because it can produce a model suitable for the data structure (see Table 6 and Fig. 9). The BFs in the MARS model use all predictor variables in the following order of importance: Repi, Vs30, and Mw.

Table 6 Comparison of performance measures for MLR and MARS in A3
Fig. 9
figure 9

Comparison of MLR and MARS predictions for A3. All plots show actual κ vs predicted model values. Circles and squares correspond to the horizontal and vertical components, respectively. Blue lines represent the case when predicted κ equals to actual κ

Although MLR and MARS are both regression techniques used for prediction tasks in this study, the results show that they have distinct characteristics and prediction performances. MARS can handle both linear and nonlinear relationships with the help of piecewise linear functions. It also adaptively models different segments of the data, making it more flexible in capturing complex nonlinear patterns. Moreover, MARS automatically captures interactions between variables by creating line segments that allow for varying relationships between different subsets of the data. Thus, MARS is likely to provide better prediction performance because of its ability to adaptively capture such relationships. For all these reasons, in the next part of the application, detailed modeling of the dataset is performed using the MARS approach.

In this part of the application, all of the areas are divided into 3 groups according to the Mw as follows: the first group includes Mw values between 3.5 and 4, the second group includes Mw values between 4 and 5, and the third group includes Mw values between 5 and 6.5. In each area, the number of observations and Mw values are different; thus, the corresponding number of observations differs according to each Mw bin. For example, for A1, the total number of observations is 166. The first group contains 38 observations, accounting for 22.89% of the total observations. The second group includes 89 observations, accounting for 53.61% of the total observations. On the other hand, 39 observations fall in the third group, accounting for 23.50% of the total observations. Table 7 presents the performance outcomes of the MARS models for A1. The findings demonstrate the highest performance values for group Mw = 4–5, particularly for horizontal κ. However, they exhibit the highest performance values for vertical κ in the Mw = 3.5–4 group. The data are more abundant within that particular range of Mw. Machine learning algorithms frequently exhibit outstanding performance when used on extensive datasets characterized by irregular structures, such as κ datasets, because of their ability to extract information directly from the data. Therefore, we attribute the algorithm’s achievement in this particular category to the greater volume of data. Overall, the findings suggest that MARS models may attain optimal performance independent of the data structure and Mw. In other words, MARS models tend to be more successful when used with Mw groups that have well–structured data dispersion.

Table 7 Performance results of the MARS models according to different Mw bins for A1

For A2, the total number of observations is 1056, and the first bin contains 310 observations, accounting for 29.36% of the total observations. The second bin includes 614 observations, accounting for 58.14% of the total observations. On the other hand, 132 observations fall in the third bin, accounting for 12.50% of the total observations. In the area with the most data, the MARS model demonstrates superior outcomes for the largest–sized earthquakes when considering horizontal κ. Conversely, its effectiveness is more pronounced for the medium–sized earthquake group when considering vertical κ (see Table 8).

Table 8 Performance results of the MARS models according to different Mw bins for A2

For A3, the total number of observations is 176, and the first bin contains 69 observations, accounting for 39.20% of the total observations. The second bin includes 88 observations that account for 50.00% of the total observations. On the other hand, 19 observations fall in the third bin, accounting for 10.80% of the total observations. Based on the results in Table 9, it can be observed that most model performances were more significant for the group Mw = 4–5. However, it is noteworthy that the vertical κ model achieved the most favorable results in the Mw = 4–5 category for all performance metrics.

Table 9 Performance results of the MARS models according to different Mw bins for A3

Finally, the entire dataset, which includes all areas, is randomly split into train (80% of all data) and test (20% of all data) datasets, and the MARS model is obtained on the train dataset. The best MARS model obtained for the horizontal component is given in Table S7, and the highest degree of interactions between the variables and the maximum number of BFs are defined as 3 and 500, respectively. The best MARS model obtained for the vertical component is given in Table S8, and the highest degree of interactions between the variables and the maximum number of BFs are defined as 2 and 500, respectively. The results in Table 10 indicate that the prediction capability of the MARS models built both for horizontal and vertical κ are excellent for train and test datasets for most of the performance measures, and the train models capture the main structure of the train data well. It is important to keep in mind that the evaluation of model performance is not solely based on performance measures. For this purpose, to show the prediction ability of the MARS model, three different scatter plots (in Figs. 10, 11, 12, 13) that show MARS model performance using (1) the correlation between the predicted and actual response values, (2) the relationship between fitted values and standardized residuals, and (3) the plot of standardized residuals versus observation order are obtained both for the train and test datasets. In a scatter plot representing the correlation between predicted and actual response values, the best result is typically achieved when the predicted values align closely with the actual values along a diagonal line. This diagonal line represents a perfect correlation, where the predicted values are exactly equal to the actual values. In the case of modeling horizontal κ, if the points on the scatter plot around the diagonal line are examined, the MARS model’s predictions are close to the true values, and there is a good relationship between the predicted and actual responses. Furthermore, the best result in the scatterplot occurs when the standardized residuals are randomly scattered around the horizontal line at 0 (the x–axis). In Figs. 10, 11, 12, 13, the second subplot indicates that the residuals have no systematic patterns or biases. In other words, the model effectively captures the underlying patterns in the data for both the training and testing datasets, and the errors are evenly distributed across the range of fitted values. Finally, a more efficient result is obtained for the scatter plot for standardized residuals versus observation order because the residuals are randomly scattered around the horizontal zero line, indicating a well–behaved and unbiased model that meets the assumptions of error independence. Considering the performance results of the MARS model obtained for vertical κ, the results obtained from the train data are generally good, but R2 and Adj. R2 values obtained for the test data are higher than the train data results. This shows that the MARS model obtained in vertical κ can be used for future prediction.

Table 10 Train and test data performance results of the MARS models for horizontal and vertical κ responses
Fig. 10
figure 10

Scatter plots of the prediction MARS model for the horizontal κ response on the train dataset

Fig. 11
figure 11

Scatter plots of the prediction MARS model for the horizontal κ response on the test dataset

Fig. 12
figure 12

Scatter plots of the prediction MARS model for the vertical κ response on the train dataset

Fig. 13
figure 13

Scatter plots of the prediction MARS model for the vertical κ response on the test dataset

6 Conclusions

In this study, classic MLR models and MARS–based predictions are developed for the high–frequency spectral attenuation parameter κ. Horizontal and vertical κ models were created for three areas within the Aegean Sea region of Türkiye. The selection of areas is contingent on the tectonic and geological characteristics present within the given area. The predictive capabilities of the methods were tested in terms of selected, well–known performance measures. Based on the numerical results of this study, the following conclusions are drawn:

  • In the MLR models, in terms of Mw, Repi, and VS30, the performance measures indicate poor predictive capability regardless of the area.

  • The MARS model, which employs piecewise continuous basis functions, demonstrates significantly enhanced predictive performance compared with the MLR model. This improvement can be attributed to the MARS model’s capacity to dynamically capture and represent the unique properties inherent in the dataset.

  • The performance of the MARS model is shown to be nearly equivalent across all areas, regardless of whether it is applied to horizontal or vertical components.

  • The findings indicate that optimal MARS performance may be achieved irrespective of the data structure and earthquake magnitude bins.

  • The MARS–based prediction equations developed in this study may be used to perform more precise and reliable κ estimations in the Aegean Sea region. The κ models developed here can be used in future studies concerned with stochastic ground motion simulations, site characterization, and ground motion models in the Aegean Sea region of Türkiye.

  • Finally, the findings of this study indicate that MARS–based models are promising for capturing the complex characteristics of κ datasets globally.