1 Introduction

Seismicity-induced liquefaction is one disastrous type of geohazards, because it may trigger ground failures, such as slope failure, damage to pile foundation, and structural instability of buildings, thus leading to fatal consequences (Huang & Miao, 2013). Soil liquefaction generally occurs when soil masses substantially lose their strength and stiffness due to gradually increased pore water pressures and reduced effective stresses (Jefferies & Been, 2019), which will result in a case that the solid soil behaves temporarily like a kind of viscous liquid. Based on disparities in mechanism of deformation, soil liquefaction can be divided into flow liquefaction and cyclic mobility (Steven, 1996). For the former, soil will experience large-scale deformations both under static and cyclic loading; and for the latter, soil will experience deformations incrementally, maybe eventually causing large-scale permanent deformations termed “lateral spreading”. Largely due to the complexity, variability and uncertainty of soil structures, soil liquefaction remains a difficult but hot research topic in the fields of soil mechanics and geotechnical engineering (Bao et al., 2019).

To assess soil liquefaction potentials, various methods have been developed over the past few decades, including field methods (Leslie Youd & Idriss, 1996), laboratory methods (Monkul et al., 2015), analytical methods (Youd et al., 1999), numerical methods (Alavi & Gandomi, 2012; Galavi et al., 2013), Geographic Information System (GIS) methods (Sumedh Mhaske & Choudhury, 2010), artificial intelligence models ( Hoang & Bui, 2018; Chen et al., 2020; Samui & Sitharam, 2011) and the like. Accurately planned and executed laboratory methods are often time-consuming and costly, so they are normally used in high-cost and high-risk cases. Normally, site investigations and laboratory tests are combined to first obtain soil parameters, which are then help build numerical models. Even so, it is a tough task to obtain the representative and “undisturbed” test samples from in-situ deposits, and human and instrumental errors may occur, thus yielding less reliable results (Goharzay et al., 2017). Traditional analytical methods are only applicable in simplified cases, and they may fail in practical soil liquefaction assessments. As to numerical simulation, the potential difficulties include the anisotropy and non-uniformity nature of soils, which would impede the development of an accurate cyclic elasto-viscoplastic constitutive model, due to uncertainties involved in the soil environment and seismic conditions.

Thanks to ever strong computing power, explosive date collection ability and flexible open-source software platforms (Buitinck et al., 2013; Pedregosa et al., 2011), machine learning (ML) technologie can now serve as novel and powerful tools in soil liquefaction prediction, including Support Vector Machines (SVMs) (Steinwart & Christmann, 2008), Bayesian Ridge and Kernel Ridge regression based models (Cristianini & Shawe-Taylor, 2000; Zhuang & Zhou, 2019), tree-based models (Clark & Pregibon, 2017), etc. More importantly, a multitude of datasets have been accumulated from in-situ experiments, laboratories and historic cases over the past decades. Machine learning models can help uncover complicated hidden patterns behind these raw real-world datasets and assess soil liquefaction potentials (Pal, 2006; Zhou et al., 2019). Predicting earthquake-induced soil liquefaction is basically a process of supervised learning. In fact, a wide variety of classical machine learning models have been applied in the soil liquefaction classification, such as artificial neural networks (ANNs), support vector machines (SVMs) (Hoang & Bui, 2018; Samui & Sitharam, 2011), decision trees (DTs) (Gandomi et al., 2013), extreme learning machines (ELMs) (Samui et al., 2016), and adaptive neuro fuzzy inference system (ANFIS) (Xue & Yang, 2013).

In terms of robustness, flexibility and generality of machine learning models implemented in soil liquefaction assessment, recent progresses can be summarised into the following three aspects: (1) they are evolving from shallow to deep architectures; (2) they are combined with heuristic algorithms; and (3) they resort to ensemble methods. For shifting from shallow to deep architectures, many efforts have been made by using the state-of-art deep learning strategies (Liang and Srikant, 2016). As deep architectures can automatically extract hierarchies of features in a layer-by-layer manner, they are suitable for large or complex raw datasets. Kumar et al. (2021) applied a deep learning model to classify the soil tendency for liquefaction based on a dataset acquired from CPT tests in Chi-Chi earthquake, finding that this model outperformed the emotional back-propagation neural network (EmBP). Zhang et al. (2021) applied a deep neural network (DNN) to predict soil liquefaction parameters based on shear wave velocity. In addition, the combination of many heuristic algorithms has been explored, such as grey wolf optimization (GWO), particle swarm optimization (PSO), genetic algorithm (GA) and other classical supervised learning methods (Cai et al., 2021; Goharzay et al., 2020; Rahbarzare & Azadi, 2019; Zhou et al., 2021). More recently, ensemble learning has been applied in predicting earthquake-induced liquefaction, with the prediction results showing that the performance of model predictive capabilities can be enhanced compared with single machine learning models (Alobaidi et al., 2019; Zhang & Wang, 2021). However, the existing studies either focused on certain one dataset source or followed one certain ensemble method, thus unavoidably time-consuming. The information and knowledge of operation on one dataset was not collected and transferred to operation on other related data sources, and therefore different models remain isolated. To build a machine learning model capable of handling multi-source datasets and multiple tasks, this study will investigate a more generalised approach based on the state-of-the-art deep feature extraction, deep Autoencoder and other domain adaption techniques.

In this study, a unified framework based on Deep Fisher Discriminant Analysis (DDA) (Dίaz-Vico & Omari, 2017) and Gaussian Process (Banerjee et al., 2013) will be first proposed and examined, and it will be deployed as follows: firstly, as a deep feature extraction algorithm, DDA is used to extract a subset of features from a data set. This subset of features is then merged with the original data set. Secondly, Gaussian Process is deployed for binary classification, eventually resulting in a deeper machine learning architecture, DDA-GP. Additionally, this study formulates a domain adaption method for multi-source transfer learning with a deep Autoencoder for feature encoding and data reconstruction, and an instance-based boosting algorithm is used in building up the learning system. The performance and generality of the proposed models are investigated and compared with other state-of-the-art machine learning models on multiple datasets, which cover those widely used in-situ tests and use the repeated stratified K-fold cross validation. Also, the convergence graphs and the evaluation metrics for the deep Autoencoder and three instance-based transfer learning methods are compared and explained.

2 Description of multi-source datasets

2.1 Data sources and feature candidates

Many soil liquefaction cases have been compiled and published by researchers, and they comprise data from different types of in-situ tests and historic records all over the world (Cetin et al., 2018; Jilei et al., 2021). To verify the performance of the classifiers in this study, five datasets based on the following features are employed: shear wave velocity (Vs), cone penetration test (CPT), standard penetration test (SPT), real cases of earthquake-induced soil liquefaction, and historical gravelly soil liquefaction. Firstly, a brief description is given to all the input features regarding the datasets.

Dataset 1 consists of 174 liquefaction and non-liquefaction cases based on shear wave velocities summarized by Hsein Juang and Chen (2000). This dataset contains 98 non-liquefaction fields and 76 liquefaction fields. Seven input factors are used as inputs, including the depth of the soil layer D (m), the total overburden stress \({\sigma }_{\mathrm{v}}\) (kPa), the effective overburden stress \({\sigma }_{\mathrm{v}}^{\mathrm{^{\prime}}}\) (kPa), the soil class (SC), the shear wave velocity \({V}_{\mathrm{s}}\) (m/s), the peak horizontal ground surface acceleration \({a}_{\mathrm{max}}\) (g) and the earthquake moment magnitude \({M}_{\mathrm{w}}\), so as to predict whether a case of liquefaction would occur. A detailed description of the soil class is seen in Hsein Juang and Chen (2000).

Dataset 2 contains 226 CPT-based historical cases summarized by Goh and Goh (2007), and is composed of 133 liquefied cases and 93 non-liquefied cases. Six features are used for training and prediction: the cone tip resistance \({q}_{\mathrm{c}}\) (MPa), the sleeve friction ratio \({R}_{\mathrm{f}}\) (%), the effective stress \({\sigma }_{\mathrm{v}}^{\mathrm{^{\prime}}}\) (kPa), the total stress \({\sigma }_{\mathrm{v}}\) (kPa), the maximum horizontal ground surface acceleration \({a}_{\mathrm{max}}\) (g), and the earthquake moment magnitude \({M}_{\mathrm{w}}\).

Dataset 3 is a SPT-based liquefaction assessment obtained by Kurnaz et al. (2019). This case belongs to the Chi-Chi earthquake in Taiwan (\({M}_{\mathrm{w}}\)=7.6). The input parameters used in Dataset 3 include: depth of soil specimen Z, SPT blow number \({N}_{60}^{1}\), percent finest content less than \(75 \mathrm{\mu m}\) F75, depth of groundwater table \({d}_{\mathrm{w}}\), total and effective overburden stresses \({\sigma }_{\mathrm{vo}}\) and \({\sigma }_{\mathrm{vo}}^{\mathrm{^{\prime}}}\), and maximum peak ground acceleration \({a}_{\mathrm{max}}\) (g).

Dataset 4 is compiled with 620 real earthquake cases, containing 12 features. It includes the following soil and seismic parameters: depth of the soil layer Z, corrected standard penetration blow number \({N}_{60}^{1}\), percent finest content less than \(75 \mathrm{\mu m}\) F75, depth of ground water table \({d}_{\mathrm{w}}\), effective overburden stresses \({\sigma }_{\mathrm{vo}}\), total overburden stresses \({\sigma }_{\mathrm{vo}}^{\mathrm{^{\prime}}}\), threshold acceleration \({a}_{\mathrm{t}}\), cyclic stress ratio CSR, shear wave velocity \({V}_{\mathrm{s}}\), internal friction angle of soil \({\rho }^{\mathrm{^{\prime}}}\), earthquake magnitude \({M}_{\mathrm{w}}\), and maximum horizontal acceleration at ground surface \({a}_{\mathrm{max}}\) (g). This set of data was recorded by Hanna et al. (2007) in two major earthquakes that occurred in China and Turkey in 1999, with 256 liquefaction fields and 364 non-liquefaction fields recorded, respectively.

Dataset 5 is yielded from a newly published raw-data brief paper (Jilei et al., 2021) with missing values. Hu summarized information on certain liquefaction cases of gravelly soil in 17 earthquakes and compiled 234 historical cases based on shear wave velocity tests and dynamic penetration tests, with 98 liquefied cases and 136 non-liquefied cases recorded. The original datasets consist of 18 features with various missing values, and theese18 features are: moment magnitude \({M}_{\mathrm{w}}\), epicentral distance R (km), bracketed duration \(t\) of the earthquake with peak ground acceleration (PGA) greater than 50 gallens, PGA, fines content (FC) (%), gravel content (GC) (%), average practical size \({D}_{50}\) (mm), corrected dynamic penetration test blow count \({N}_{120}^{\mathrm{^{\prime}}}\), corrected shear wave velocity \({V}_{\mathrm{s}1}\), effective overburden stresses \({\sigma }_{\mathrm{v}}^{\mathrm{^{\prime}}}\) (kPa), depth of groundwater table \({D}_{\mathrm{w}}\) (m), depth of gravelly soil deposit \({D}_{\mathrm{s}}\) (m), thickness of the impermeable capping layer \({H}_{\mathrm{n}}\) (m), thickness of the unsaturated zone between the groundwater table and capping layer \({D}_{\mathrm{n}}\) (m), magnitude scaling factor (MSF), shear stress reduction factor \({r}_{\mathrm{d}}\), cyclic stress ratio (CSR), and cyclic stress ratio adjusted for earthquake with \({M}_{\mathrm{w}}=7.5\) (\({\mathrm{CSR}}_{7.5}\)). However, the data need to be further cleaned for high quality; especially the feature space shall be reduced and a proper strategy be made to deal with the missing values.

2.2 Data imputation techniques

Incomplete datasets are commonly seen in geotechnical engineering, resulting in datasets with missing values. This issue remains to be solved, regardless of advancements in missing-data imputation techniques over the past years (Zhu et al., 2018). The impact of missing values depends on the data-missing mechanisms, which are generally categorised into three major classes: (1) Missing Completely at Random (MCAR); (2) Missing at Random (MAR); and (3) Missing Not at Random (MNAR), as detailed in Little and Rubin (2002). For Dataset 5, the missingness may be systematically different from the observed values, but it can be explained with other observed values. Discarding the missing values may bias the overall data, so data imputation is necessary. Before accounting and handling those data absences, this study delves into the missing data landscapes and patterns in Dataset 5. Missing no library offers a convenient way to visualise the missing data (Bilogur, 2018). As shown in Fig. 1(a), the nullity matrix displays the overall dispersion of missing data in Dataset 5 over all 18 features. Obviously, the missing values exist for the input features of R, t, FC, GC, \({D}_{50}\), \({H}_{\mathrm{n}}\) and \({D}_{\mathrm{n}}\), but they are not highly concentrated on one specific row of features or column of a certain feature. For deeper details, the nullity correlation heat map is depicted in Fig. 1(b), which reveals how strongly the presence or absence of one variable would affect others. In Fig. 1(b), the nullity correlation ranges from − 1 to 1. A negative correlation indicates that if a variable is absent, other variables tend to appear. On the other hand, a positive correlation indicates that if one variable is absent, other variables tend to be absent as well. 0 indicates no obvious nullity correlation. Features with no missing values are omitted in this graph. From Fig. 1(b), it can be observed that there is a strong positive nullity correlation among GC, \({D}_{50}\) and FC. Those boxes labelled as \(<1\) indicate that two variables are very close to being exactly positive, which requires special attention, since the two features are strongly nullity correlated.

Fig. 1
figure 1

Missing values’ patterns for Dataset 5

Another concern in this study is how to deal with the missing values. The direct way is to delete the rows containing missing values, but this will generally cause the loss of valuable information. Another option is to adopt imputation techniques. For non-time series datasets, the simplest way is to impute the missing values with a certain constant, such as mean, median or most frequent values. More advanced multivariate imputation algorithms use the k-nearest neighbour (KNN) based imputation techniques and the Multivariate Imputation by Chained Equations (MICE). Another methodology is the missing-indicator method (Huberman & Langholz, 1999), which specifies all missing data with certain fixed values while an extra indicator or new dummy variable is added to the original feature space to indicate the presence or absence of related values. To determine the best strategies suitable for handling the missing values in soil liquefaction Dataset 5, those imputation strategies are deployed and compared by using a Gaussian process (GP) classifier. Figure 2 shows the performance of GP on Dataset 5 when using different imputation algorithms, with no other data pre-processing approaches used.

Fig. 2
figure 2

Comparison of data imputation techniques on Dataset 5

From Fig. 2, it can be observed that the missing-indicator method yields the best performance. Thus, this method will be adopted in this application.

2.3 Data visualization and analysis

Histograms and correlation coefficients for all input features are demonstrated to explicit visualize the distribution and correlation coefficients of the features.

Figure 3 shows the pairwise scatter plot matrix for Dataset 1. The frequency distribution histograms and Pearson correlation coefficients are illustrated on the diagonal line and upper half of the matrix. The red lines on the histograms are the density curves, while the red stars on the upper half stand for the p-values. On lower half of the matrix is the linear regression that fits for the two variables. The marginal frequency distribution curves demonstrate that not all variables have Gaussian distribution. Some features are very large in the magnitudes and therefore further feature scaling techniques need to be deployed, such as soil class (SC). Moreover, it should be noted that the soil layer D (m), the total overburden stresses of \({\sigma }_{\mathrm{v}}\) and \({\sigma }_{\mathrm{v}}\) and the effective overburden stress \({\sigma }_{\mathrm{v}}^{\mathrm{^{\prime}}}\) are all strongly correlated. The shear wave velocity \({V}_{\mathrm{s}}\) is moderately correlated with D, \({\sigma }_{\mathrm{v}}\) and\({\sigma }_{\mathrm{v}}^{\mathrm{^{\prime}}}\). The p-values of the correlation for all possible pairs of variables are attached to the correlation coefficient panel, which corresponds to the significance levels of correlations. The red symbols (“***”, “**”, “*”, and “.”) denote p-value \(\le 0.001\) (highly significant), p-value \(\le 0.01\) (very significant), p-value \(\le 0.05\) (significant), and p-value \(>0.05\) (not significant), respectively. From the p-values, it can be observed that there exists statistically significant correlation between the soil layer D (m) and the total overburden stress\({\sigma }_{\mathrm{v}}\). Also, for other features with three stars, the statistical correlations between them are also highly significant.

Fig. 3
figure 3

Pairwise scatter plot matrix: histograms and correlation coefficients of all input features for Dataset 1

Figure 4 shows the pairwise scatter plot matrix for Dataset 2. The same configured plots and values are presented as described above. Although many features show a near normal distribution, there is a large variance in magnitudes for several features, such as the maximum horizontal ground surface acceleration \({a}_{\mathrm{max}}\), so specific feature scaling techniques need to be applied. The largest correlation coefficient appears between the effective stress and the total stress \({\sigma }_{\mathrm{v}}\).

Fig. 4
figure 4

Pairwise scatter plot matrix: histograms and correlation coefficients of all input features for Dataset 2

From Fig. 5, it can be observed that the marginal frequency distribution curves for most features do not showcase Gaussian or near Gaussian distributions, so it is necessary to make deep feature scaling. For input features, the depth of soil specimen Z and the total and effective overburden stresses of \({\sigma }_{\mathrm{vo}}\) and \({\sigma }_{\mathrm{vo}}^{\mathrm{^{\prime}}}\) are mutually correlated significantly.

Fig. 5
figure 5

Pairwise scatter plot matrix: histograms and correlation coefficients of all input features for Dataset 3

From Fig. 6, it can be observed that the original raw data are of poor quality, and the density curves of most features are not close to a Gaussian distribution. For all input features, the same conclusion as for Dataset 3 can be drawn that the depth of soil specimen Z and the total and effective overburden stresses of \({\sigma }_{\mathrm{vo}}\) and \({\sigma }_{\mathrm{vo}}^{\mathrm{^{\prime}}}\) are mutually correlated significantly. This finding matches the physical phenomenon that the total stress in soil would increase with the depth and unit weight, while the effective stress would combine the effects of total stress and the pore pressure that controls soil behaviour.

Fig. 6
figure 6

Pairwise scatter plot matrix: histograms and correlation coefficients of all input features for Dataset 4

Figure 7 depicts the pairwise scatter plot matrix for Dataset 5. The diagraphs along the matrix diagonal also show the data range and histogram for each variable. A large variance can be observed between magnitudes of feature values. In addition, the moment magnitude \({M}_{\mathrm{w}}\) is significantly correlated with many other features; specifically, it is significantly negatively correlated with the magnitude scaling factor (MSF). The same can be observed for epicentral distance R (km) and bracketed duration \(t\) of the earthquake with PGA greater than 50 gal (s).

Fig. 7
figure 7

Pairwise scatter plot matrix: histograms and correlation coefficients of all input features for Dataset 5

2.4 Feature selection

As shown in Fig. 7, the original data of Dataset 5 have 18 input features. To further improve the performance of the machine learning model, the feature selection process is performedto reduce the dimensionality of feature space, and a subset of relevant features are selected for model construction, so as to discard some irrelevant features in the minimal loss of information. Basically, the major types of the feature selection methods can be divided into three branches: filter methods, wrapper methods, and embedded methods (Chandrashekar & Sahin, 2014). In filter methods, the relevant features are selected based on the intrinsic properties of the features measured via univariate statistics, rather than based on the cross-validation performance of specific machine learning models. Filter methods usually run faster and work well for any machine learning algorithms with more generality than the other two methods. In this study, because multiple machine learning models will be compared, in order to make fair and general comparison, a two-step filter method as shown in Fig. 8 will be adopted to select out those non-redundant and relevant features.

Fig. 8
figure 8

The two-step feature selection method with filter techniques for Dataset 5

First, the Variance Threshold method is applied to calculate each feature variance \(\left.\mathrm{Var}\left({X}_{i}\right)=E\left[{\left({X}_{i}-E\left({X}_{i}\right)\right)}^{2}\right)\right]\) (Chandrashekar & Sahin, 2014) and remove those features with variances below a given threshold, based on the ground that low-variance features do not convey too much information and barely contribute to the prediction of the target values. Then, the features are removed by applying the high correlation filter, aided by other reference scores. Pearson correlation coefficient (Benesty et al., 2009) in Eq. 1 mainly measures the linear association between the two variables; it is adopted for removing multicollinearity between features.

$$r=\frac{\sum \left({x}_{i}-E(x)\right)\left({y}_{i}-E(y)\right)}{\sqrt{\sum {\left({x}_{i}-E(x)\right)}^{2}\sum {\left({y}_{i}-E(y)\right)}^{2}}}$$
(1)

In order to investigate the nonlinear relationship, this study further investigates the mutual information between those features. The mutual information is generally applied to quantitatively specify the common certainty between two random features and measure their non-linear relations and mutation information based on entropy. Therefore, it is an ideal measure of stochastic dependency, since it can reduce the uncertainty of a random variable once others are known (Gabrié et al., 2019). Mutation information \(I(X;Y)\) can be calculated by:

$$I(X;Y)=H(X)+H(Y)-H(X,Y)=H(X)-H(X\mid Y)$$
(2)

where marginal entropy \(H(X)\) measures the level of expected uncertainty in random variable X:

$$H(X)=-{\sum }_{{x}_{i}\in X} P\left(X={x}_{i}\right)\mathrm{log}\left(P\left(X={x}_{i}\right)\right)$$
(3)

Joint entropy \(H(X,Y)\) measures the uncertainty when \(X\) and \(Y\) are considered together:

$$H(X,Y)=-{\sum }_{{x}_{i}\in X} {\sum }_{{y}_{i}\in Y} P\left(X={x}_{i},Y={y}_{j}\right)\mathrm{log}\left(P\left(X={x}_{i},Y={y}_{j}\right)\right)$$
(4)

and conditional entropy, which is the amount of uncertainty remaining about \(X\) when \(Y\) is known

$$H(X\mid Y)=-{\sum }_{{x}_{i}\in X} {\sum }_{{y}_{i}\in Y} P\left(X={x}_{i},Y={y}_{j}\right)\mathrm{log}\left(P\left(X={x}_{i}\mid Y={y}_{j}\right)\right)$$
(5)

A high mutual information value indicates a large reduction of uncertainty; vice versa. If the mutual information \(I\left(\mathrm{X};\mathrm{Y}\right)\) is zero, it means that \(X\) and \(Y\) are strictly independent. Also, the ANOVA (Analysis of Variance) f-value and Chi-squared stats (Kuhn & Johnson, 2013) are supplemented for feature selection.

After variance filtering by using the Variance Threshold method, it can be found from Fig. 9(a) that there are still multiple features with high correlation coefficients. Those with a value of above 50% are summarised as \({\sigma }_{\mathrm{v}}^{\mathrm{^{\prime}}}\) and \({D}_{\mathrm{s}}\), \({V}_{\mathrm{s}1}\) and \({N}_{120}^{\mathrm{^{\prime}}}\), \({D}_{50}\) and GC, R and \({M}_{\mathrm{w}}\), and \({D}_{\mathrm{w}}\) and \({\sigma }_{\mathrm{v}}^{\mathrm{^{\prime}}}\). In Fig. 9(b)–(d), the bar charts of the feature importance scores based on ANOVA f-value tests, Mutual information and Chi-squared stats (Kuhn & Johnson, 2013) are sorted from the worse to the best. Recall \({\sigma }_{\mathrm{v}}^{\mathrm{^{\prime}}}\) and \({D}_{\mathrm{s}}\) are highly correlated, and \({D}_{\mathrm{s}}\) is a less important feature than \({\sigma }_{\mathrm{v}}^{\mathrm{^{\prime}}}\) in all importance rankings. Although \({V}_{\mathrm{s}1}\) and \({N}_{120}^{\mathrm{^{\prime}}}\) are highly correlated, in linear tests, \({V}_{\mathrm{s}1}\) is more important than \({N}_{120}^{\mathrm{^{\prime}}}\); however, for mutual information, \({N}_{120}^{\mathrm{^{\prime}}}\) is more important; then, those two are kept. \({D}_{50}\) is more important than GC; then, \({D}_{50}\) is kept. For the same reason, feature \({D}_{\mathrm{w}}\) is dropped.

Fig. 9
figure 9

Results obtained by using multiple filter methods for feature selection after low-variance filtering

2.5 Feature scaling

From Figs. 3, 4, 5, 6, 7, it can be observed that the range of magnitudes of those input feature varies broadly; and the use of Euclidian distances in such machine learning algorithms as decision trees, MLP and k-nearest neighbors (KNN) may not be appropriate. Therefore, the feature scaling method is adopted to normalize the scale of input features, serving as a preprocessing step prior to the model construction (Jahangiri & Rakha, 2015). In this application, the min–max scaling method (sklearn.preprocessing.MinMaxScaler) is selected to rescale inputs to a fixed range, usually [0, 1]. This bounded range has the advantage of smaller standard deviations, while suppressing the effect of outliers and improving the ML predictive capability with small datasets. The min–max scaling is typically done via the following equation:

$${X}_{\mathrm{SC}}=\frac{X-{X}_{\mathrm{min}}}{{X}_{\mathrm{max}}-{X}_{\mathrm{min}}}$$
(6)

3 Overall methodology

3.1 Deep Fisher discriminant analysis

Fisher’ linear discriminant analysis (FLDA) is a very widely used feature extraction technique. Because of its linear nature, however, FLDA can hardly extract certain nonlinear relationships among feature spaces and thus is limited in applications. Proposed by Diaz et al., Deep Fisher discriminant analysis (DDA) is a non-linear Deep Neural Network (DNN) extension of Fisher' Linear Discriminant Analysis (FLDA) (Dίaz-Vico & Omari, 2017). First, Least Squares Regression (LSR) needs to be introduced to establish a link with FLDA. Second, LSR problem can be brought directly into the DNN setting. A straightforward DNN extension of LDA can be expressed as follows (Dίaz-Vico & Omari, 2017):

  1. 1.

    Solve the LSR problem by using a training set \({D}_{\mathrm{tr}}\)

    $$\mathrm{min}\frac{1}{2}{\left|{Y}_{\mathrm{tr}}-f\left({X}_{\mathrm{tr}},\mathcal{W}\right)\right|}^{2}$$
    (7)

    where \({Y}_{\mathrm{tr}}\) indicates the previous training target matrix; the p-th row of the matrix \({f\left({X}_{\mathrm{tr}},\mathcal{W}\right)}_{\mathrm{p}}\) is determined by \({f\left({X}_{\mathrm{tr}},\mathcal{W}\right)}_{\mathrm{p}}=f\left({x}_{\mathrm{p}},\mathcal{W}\right)\), where \(f\left({x}_{\mathrm{p}},\mathcal{W}\right)\) indicates a smooth function defined by a deep neural network and \(\mathcal{W}\) indicates the overall weight set. An optimal DNN weight set \({\mathcal{W}}^{*}\) can be obtained through training.

  2. 2.

    Compute the DNN projections \({y}_{\mathrm{p}}=f\left({x}_{\mathrm{p}},{\mathcal{W}}^{*}\right)\) over \({X}_{\mathrm{tr}}\) and their class means \({\overline{y} }_{c}\).

  3. 3.

    Compute the DNN mapping \(y=f\left(x,{\mathcal{W}}^{*}\right)\) of the testing samples \({x\in D}_{tr}\) and score the performance according to the distances between the DNN outputs and the class means \({\overline{y} }_{\mathrm{c}}\). And then, assign the highest score to the class.

Set \({\mathcal{W}}^{*}=\left({w}_{0}^{*},{w}^{*},{\mathcal{W}}^{*}\right)\), where \({w}_{0}^{*}\) and \({w}^{*}\) are the linear output weights. The optimal \({w}_{0}^{*}\) and \({w}^{*}\) are used to solve the LSR problem over the last hidden layer feature \(z=\Phi \left(x,{\mathcal{W}}^{*}\right)\), where \(\Phi\) indicates the mapping defined by the deep neural networks. And the nonlinearity in DNN is introduced by the user-defined activation function. In this way, the class-mean classifier of the full DNN is comparable to the class-mean on FLDA projection of the z patterns at the last hidden layer, and these class-mean classifiers are also learned by adjusting the overall weight \(\mathcal{W}\) component. In other words, the DNN also performs a special kind of representational learning. This process is called Deep Fisher Discriminant Analysis or DDA in short.

To avoid the singularity problem, a regularization term can be added to the fitness, and the simplest form can be expressed as follows:

$${\mathrm{min}}_{w,W} \frac{1}{2}{\left|Y-f\left(X,{w}_{0},w,W\right)\right|}^{2}+\frac{\lambda }{2}\mathrm{trace}\left({w}^{t}w+{\tilde{W}}^{t}\tilde{W}\right)$$
(8)

where \(\tilde{W}\) is the components of \(w\), not including the bias of each hidden layer. It is important to note that the \(\frac{\lambda }{2}\mathrm{trace}\left({w}^{t}w\right)\) entry should be retained in all cases.

3.2 Gaussian process classification

The Gaussian process (GP) is a stochastic process in the probability theory and mathematical statistics (Rasmussen, 2003; Williams & Rasmussen, 2006), and the Gaussian process model in machine learning is purposed to place a Gaussian process as a prior over function with kernels and create a posterior distribution over function with datasets.

In the context of binary Gaussian process classification (GPC), a GP prior should be placed over the latent function \(f(x)\). Given an input \({\varvec{x}}\), the target values \({y}_{i}\in \left\{\mathrm{0,1}\right\}\) follows a Bernoulli distribution; to make the GPC model discriminative, the outputs from a Gaussian process is squashed through a logistic link function, so as to obtain the probabilistic classification. One commonly used link function is the logistic sigmoid function \(\sigma ,\) and the probability function can be estimated by:

$$p(y=1\mid {\varvec{X}}={\varvec{x}})=\frac{1}{1+\mathrm{exp}(-f({\varvec{x}}))}=\sigma (f)$$
(9)

For any observed data \(\mathbf{x}=\left({{\varvec{x}}}_{1},\cdots ,{{\varvec{x}}}_{n}\right)\) with binary labels \({y}_{i}\in \left\{\mathrm{0,1}\right\}\), Gaussian process over latent function \(f({\varvec{x}})\) can be defined by a mean function \(m(\mathbf{x})\) and a covariance function \(k({\varvec{x}},{{\varvec{x}}}^{\boldsymbol{^{\prime}}})\) as follows:

$$f({\varvec{x}})\sim \mathcal{G}\mathcal{P}\left(m({\varvec{x}}),K\left({\varvec{x}},{{\varvec{x}}}^{^{\prime}}\right)\right)$$
(10)

where \(K\left({\varvec{x}},{{\varvec{x}}}^{\mathrm{^{\prime}}}\right)\) is the covariance matrix with entries \({K}_{i,j}=k({{\varvec{x}}}_{{\varvec{i}}},{{{\varvec{x}}}_{{\varvec{i}}}}^{\boldsymbol{^{\prime}}})\).


According to Bayes’ theorem, the posterior on the latent variable is given by:

$$p(f\mid {\varvec{x}},y)=\frac{p(y\mid f)p(f\mid {\varvec{x}})}{p(y\mid {\varvec{x}})}$$
(11)

where \(p(y\mid {\varvec{x}})\) is independent with respect to \(f\). Then, the distribution of the link function corresponding to the test case is calculated:.

$$p\left({f}_{*}\mid {\varvec{x}},y,{{\varvec{x}}}_{*}\right)=\int p\left({f}_{*}\mid {\varvec{x}},{{\varvec{x}}}_{*},f\right)p(f\mid {\varvec{x}},y)\mathrm{d}f$$
(12)

This distribution over the latent variables is used to derive predictions for the test cases \({{\varvec{x}}}_{*}\):

$${\pi }_{*}=p\left({y}_{*}=+1\mid {\varvec{x}},y,{{\varvec{x}}}_{*}\right)=\int \sigma \left({f}_{*}\right)=\int \sigma \left({f}_{*}\right)p\left({f}_{*}\mid {\varvec{x}},y,{{\varvec{x}}}_{*}\right)\mathrm{d}{f}_{*}$$
(13)

As a result of the non-Gaussian nature of the posterior, Eq. 13 is difficult to calculate in integral terms. The numerical integration can be approximated by using Laplace’s method, which applies a Gaussian approximation \(\mathcal{N}(f\mid {\varvec{x}},y)\) to the posterior \(p(f\mid {\varvec{x}},y)\). By performing a second-order Taylor expansion of log \(p(f\mid {\varvec{x}},y)\) around the maximum of the posterior, this study obtains a Gaussian approximation:

$$\mathcal{N}(f\mid {\varvec{x}},y)=\mathcal{N}\left(f\mid \hat{f},{A}^{-1}\right)\propto \mathrm{exp}\left(-\frac{1}{2}(f-\hat{f}{)}^{\mathrm{T}}A(f-\hat{f})\right)$$
(14)

where

$$\hat{f}={\mathrm{argmax}}_{f} p(f\mid {\varvec{x}},y)$$
(15)

and

$$A=-{\left.\nabla \nabla \mathrm{log}p(f\mid {\varvec{x}},y)\right|}_{f=\hat{f}}$$
(16)

The latter is the Hessian of the negative log posterior at that point. After an iterative process of defining \(\hat{f}\) and \(A\) and calculating the mean and variance of \({f}_{*}\), the prediction can be expressed as:

$${\pi }_{*}\approx \int \sigma \left({f}_{*}\right)\mathcal{N}\left({f}_{*}\mid {\varvec{x}},y,{{\varvec{x}}}_{*}\right)\mathrm{d}{f}_{*}$$
(17)

3.3 Hybrid DDA-GP model with cross-validation

To enhance the model performance in terms of features, the hybrid DDA-GP model is integrated by extracting hierarchical features from the original feature space, while using DDA and feeding this enriched data into the Gaussian process model. The hybrid DDA-GP model is then evaluated through a repeated stratified K-fold cross-validation. The overall pipeline for the machine learning model with K-fold cross-validation method is shown in Fig. 10.

Fig. 10
figure 10

The overall procedure for soil classification assessment with machine learning

In the stratified cross-validation strategy, the training data is split into K different folds by using stratified sampling, so that each fold contains approximately the same percentage of samples for each target class as a complete set. One fold is taken as the test set, while the rest are used for training and validation. The error over the test fold is calculated. This process will be repeated for K iterations, so that each fold is used for testing once. But this is still not fair enough for comparison, since one sampling of K-fold cross-validation may result in a noisy biased estimate of model performance. However, repeated stratified K-fold cross-validation will lower the bias in model prediction. The overall performance evaluation is then regarded as the mean performance in every case.

In this way, the hybrid DDA-GP method is compared with other algorithms, such as supported vector machine, ensemble learning models (including Random forest and Extra Trees), Gaussian Process Boosting (Sigrist, 2004), LightGBM (Ke et al., 2017) and CatBoost (Dorogush et al., 2018), through the repeated stratified K-fold cross-validation. To assess the performance of machine learning methods in soil liquefaction prediction, a confusion matrix and receiver operating characteristic (ROC) curves are visualized. It should be noted that for the hybrid DDA-GP method during cross validation, the feature deep extraction DDA should be always trained on the training folds after splitting, rather than on the whole training dataset before cross-validation, so as to prevent the creation of a biased model.

4 Model evaluation and comparison

As introduced in the previous section, repeated cross-validation is used for model evaluation, and the number of folds is set to 10, with repetition of 20 times. The performance of the model is evaluated by calculating some metrics, including the classification accuracy (CAR), which is the percentage of correctly classified cases. Besides CAR, the classification performance is measured by precision, recall and \({F}_{1}\) score. The formulas for the above four metrics are as follows (Hackeling, 2017):

$$\mathrm{CAR}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}$$
(18)
$$\mathrm{Precision}= \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}};\mathrm{ Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}};{F}_{1}=2\frac{\text{ Precision }\cdot \text{ Recall }}{\text{ Precision }+ \, {\text{Recall}}}$$
(19)

where T, F, P and N represent true, false, positive and negative samples, respectively. For example, TP indicates the number of positive samples that are correctly classified; FN indicates the number of false-negative samples that are misclassified as marked in the confusion matrix in Fig. 10.

Upon evaluation, the averaged confusion matrix, Boxplot, and the receiver operating characteristic (ROC) curves using repeated-cross validation on shear wave velocity (Vs) Dataset 1 are first illustrated in Fig. 11; and the predicted DDA-GP results are also compared with those ensemble learning methods and SVM. The averaged confusion matrix in Fig. 11(a) is the averaged results for stratified tenfold cross-validation. It is obvious that only one sample of non-liquefaction state and one sample of liquefaction state are misclassified. The Receiver Operating Characteristic (ROC) curve is shown in Fig. 11(c), and it evaluates the classifier output quality by using stratified tenfold cross-validation. The true-positive rate is illustrated on Y axis, and the false-positive rate is on X axis. This means that the top left corner of the plot contains the “ideal” points. The closer the curve is to the left-top corner, the more accurate the test results. ROC curve score is computed by the area under the curve. The curve for each fold and its mean ROC is included for the positive class. It can be observed that the area under the mean ROC curve is 0.95. The Boxplot in Fig. 11(b) demonstrates the accuracy in the tenfold cross-validation repeated for 20 times. On each box, the central mark indicates the median, the edges of the box indicate the 25th and 75th percentiles, and the dots indicate the outliers. For Dataset 1, the median accuracy of DDA-GP is much higher than that of the rest ML algorithms. The difference between the lower and upper quartiles in Fig. 11(b) is smaller than in other figures. Table 1 illustrates the averaged precision, recall, \({F}_{1}\)-score and accuracy for the repeated Stratified tenfold cross-validation evaluation of classifiers. It can be found that DDA-GP demonstrates the best performance on all metrics.

Fig. 11
figure 11

(a) Averaged confusion matrix, (b) boxplot, and (c) receiver operating characteristic (ROC) curves using repeated cross-validation on Dataset 1

Table 1 Average precision, Recall, \({F}_{1}\) and CAR (%) by repeated cross-validation on Dataset 1

Furthermore, the results from the model evaluation on CPT-based liquefaction dataset are shown in Fig. 12 and Table 2. It can be seen that except one sample of Nonliquefaction state, DDA-GP correctly classifies all other liquefaction states in the averaged confusion matrix of the tenfold cross-validation, as shown in Fig. 12(a). From the ROC curves shown in Fig. 12(c), it can be observed that the mean “Area under the Curve” (AUC) reaches 0.98; and during the tenfold cross-validation, AUC frequently reaches 1. In Fig. 12(b), the DDA-GP and GP models perform best in the repeated stratified tenfold cross-validation, which largely outperforms the rest machine learning models in comparison. In Table 2, GP yields the best averaged CAR in all repeated sub-samplings, and DDA-GP is in parallel with GP and obtains superior accuracy over the rest classifiers. Both LightGBM and GPBoost fail in dealing with CPT-based liquefaction datasets.

Fig. 12
figure 12

(a) Averaged confusion matrix, (b) Boxplot, and (c) Receiver Operating Characteristic (ROC) curves using repeated cross-validation on Dataset 2

Table 2 Average precision, Recall, \({F}_{1}\) and CAR (%) by repeated cross-validation on Dataset 2

This study continues to investigate the model performance on Dataset by using SPT-based measurements. From the averaged confusion matrix of the averaged tenfold cross-validation shown in Fig. 13(a), it can be found that all non-liquefaction and liquefaction states are correctly classified. From the ROC curves in Fig. 13(c), the mean “Area under the Curve” (AUC) reaches 0.98; and during the tenfold cross-validation, AUC even reaches 1, which indicates that the proposed classifiers can distinguish liquefaction and non-liquefaction classes. From boxplots in Fig. 13(b), it can be seen that DDA-GP, GP, SVM, RF and ET all show excellent performance in repeated stratified tenfold cross-validation. More details on the evaluation metrics are offered in Table 3, which shows that the averaged precision, recall, \({F}_{1}\)-score and accuracy for the five classifiers are quite high, all reaching 94% CAR.

Fig. 13
figure 13

(a) Averaged confusion matrix, (b) boxplot, and (c) receiver operating characteristic (ROC) curves using repeated cross-validation on Dataset 3

Table 3 Average precision, Recall, \({F}_{1}\) and CAR by repeated cross-validation for Dataset 3

Next, this study will evaluate the model performance on a challenge dataset collected from real-world earthquake cases with a larger data size. From the obtained averaged confusion matrix of the tenfold cross-validation shown in Fig. 14(a), it can be seen that four non-liquefaction states and four liquefaction states are all misclassified. From the ROC curve in Fig. 14(c), it can be observed that the mean “Area under the Curve” (AUC) reaches 0.94 and the proposed classifiers can distinguish most of the liquefaction and non-liquefaction classes. From the boxplot in Fig. 14(b), it can be found that DDA-GP shows the best performance, with much higher values than the rest classifiers in repeated stratified tenfold cross-validation. For all evaluation metrics shown in Table 4, DDA-GP delivers the best averaged CAR, reaching 87% in repeated subsampling. For the real-case datasets, the performance of other machine learning models deteriorates exceedingly, while DDA-GP still maintains high accuracy.

Fig. 14
figure 14

(a) Averaged confusion matrix, (b) boxplot, and (c) receiver operating characteristic (ROC) curves using repeated cross-validation on Dataset 4

Table 4 Average precision, Recall, \({F}_{1}\) and CAR by repeated cross-validation for Dataset 4

Dataset 5 has been scarcely investigated by the existing researches; it comprises liquefaction case histories of gravelly soil. The performance of all machine learning algorithms is evaluated. For the obtained averaged confusion matrix of the tenfold cross-validation shown in Fig. 15(a), only one case of non-liquefaction and one case of liquefaction are incorrectly classified. The mean “Area under the Curve” (AUC) is 0.93 in the ROC curve in Fig. 15(c). In the boxplot Fig. 15(b) on the accuracy of the repeated stratified tenfold cross-validation, classifiers of both Random forest and Extra tree show better performance than the rest classifiers. DDA-GP ranks the third in the median values, but with fewer outliers. More details are shown in Table 5, which shows that the precision of DDA-GP matches well with RF and ET, and the average CAR is also pretty high in repeated cross-validation.

Fig. 15
figure 15

(a) Averaged confusion matrix, (b) boxplot, and (c) receiver operating characteristic (ROC) curves using repeated cross-validation on Dataset 5

Table 5 Average precision, Recall, \({F}_{1}\) and CAR by repeated cross-validation for Dataset 5

Further, the Wilcoxon signed-rank test is supplemented to check whether the experimental results are statistically inaccurate. The test ranks the performance differences between the two classifiers for each data set by ignoring the signs, and then compares the ranking of positive and negative differences (Demšar, 2006). In this study, the significance level for Wilcoxon signed-rank test is set to 0.05. The p-values of the Wilcoxon signed rank test on the five datasets are calculated. If the p-value of the test is below the critical value of 0.05, the two models will be regarded as having significantly statistical differences.

Tables 6, 7, 8, 9, 10 show the comparison of the models based on the Wilcoxon signed rank test. In the tables, symbols “+  +”, “+”, “*”, “−” and “− −” represent a significant win, a win, a tie, a loss and a significant loss, respectively. A significant win implies that at least \(\frac{N}{2}+1.96\frac{\sqrt{N}}{2}\) wins in pairwise comparisons, where N is the total number of tests. It be concluded from the tables that DDA-GP has statistical advantages in all cases, except the SPT-based dataset and the gravelly soil dataset. Even for the SPT-based dataset, its statistical performance reaches a tie. For the gravelly soil dataset, DDA-GP still shows statistical advantages over most of the machine learning models, as indicated in the comparison. For the shear wave velocity dataset and the real-world earthquake dataset, GPBoost loses to all other classifiers. For CPT-based dataset, SPT-based dataset and real historic dataset of gravelly soil, LightGBM loses to all other classifiers.

Table 6 Comparison outcomes obtained from the Wilcoxon signed rank test on Dataset 1
Table 7 Comparison outcomes obtained from the Wilcoxon signed rank test on Dataset 2
Table 8 Comparison outcomes obtained from the Wilcoxon signed rank test on Dataset 3
Table 9 Comparison outcomes obtained from the Wilcoxon signed rank test on Dataset 4
Table 10 Comparison outcomes obtained from the Wilcoxon signed rank test on Dataset 5

5 Transfer learning for multiple data sources

In the above model evaluation, although DDA-GP shows satisfactory performance on all soil liquefaction datasets, the machine learning models have been tuned individually and trained from the scratch, as shown on the left side of Fig. 16. In this case, machine learning algorithms are designed for specific data and reach a high level of accuracy only in specific data ranges. This may costly and generally require a vast batch of data, and the model’s generality is invariably limited. However, all data sources in the prediction of soil liquefaction are collected from multiple source domains, which share the same task and target. It is significant to investigate and bridge the inner relationship between different data sources. Transfer learning for multiple data sources provides an effective means to solve these problems (Ding et al., 2016). A comparison between the conventional machine learning and the transfer learning is shown in Fig. 16. The multiple data sources’ transfer learning systems would transfer knowledge from multiple data sources, so as to learn the tasks in the target domains. One important prerequisite for such transfer learning is that there are correlations between the target and the source domains. Thus, the knowledge extracted from the data in the source domains will be closely related to the target domain, making it possible to create a robust predictive model for the targeted data sources. This is just the case for not only soil liquefaction prediction, but for other geotechnical engineering prediction problems, such as rock burst prediction. Different sources of data can be collected from different laboratory tests, in-situ tests, etc. They generally share several input features. In this investigation, the examples include the physical properties of soils and the mechanical and physical quantities from seismic conditions; more importantly, they share the same learning tasks. With advancements of Internet of Things (IoT), it will be more convenient for civil engineers to collect data for certain prediction tasks in engineering; however, no buildings for multiple data sources’ transfer learning models in the field of geotechnical engineering have been studied. This is the primary task of this investigation.

Fig. 16
figure 16

A comparison between conventional machine learning and transfer learning approaches

As shown in Fig. 16, a given source domain is denoted as \({\mathcal{D}}_{\mathrm{s}}\) its corresponding task is expressed as \({\mathcal{T}}_{\mathrm{s}}\), and the learnt function that interprets the knowledge is defined as \({\mathcal{F}}_{\mathrm{s}}\). The next goal is to get the target predictive mapping \({\mathcal{F}}_{\mathrm{t}}\) for target data \({\mathcal{D}}_{\mathrm{t}}\) and task \({\mathcal{T}}_{\mathrm{t}}\). Transfer learning can help improve the performance of \({\mathcal{F}}_{\mathrm{t}}\) with the knowledge inherited from \({\mathcal{F}}_{\mathrm{s}}\), where \({\mathcal{D}}_{\mathrm{s}}{\ne \mathcal{D}}_{\mathrm{t}}\) or \({\mathcal{T}}_{\mathrm{s}}\ne {\mathcal{T}}_{\mathrm{t}}\). Because of different data sources for soil liquefaction, they generally have different source domains, not only in the number of features, but also in feature magnitudes with the same tasks, namely \({\mathcal{D}}_{\mathrm{s}}{\ne \mathcal{D}}_{\mathrm{t}}\) but \({\mathcal{T}}_{\mathrm{s}}={\mathcal{T}}_{\mathrm{t}}\). Domain adaptation technique is usually used to transfer knowledge from a source to a target in this circumstance (Wen et al., 2020). Domain adaption aims at finding the mapping, so as to reduce the domain divergence between \({\mathcal{D}}_{\mathrm{s}}\) and \({\mathcal{D}}_{\mathrm{t}}\). The domain divergence extracted by domain adaption can help improve the model performance of \(\mathrm{t}\), which will significantly reduce the need for labeled data. This study tries mainly on the data reconstruction approaches that utilize a deep Autoencoder and a boosting-based method with Gaussian process classifiers as the base model for multi-source transfer learning, which combines the advantages from the feature-based and instance-based domain adaption models.

5.1 Deep Autoencoder for data reconstruction

As shown in Fig. 17, the deep Autoencoder is deployed for data reconstruction between the data source domain \({\mathcal{D}}_{\mathrm{s}}\) and the target domain \({\mathcal{D}}_{\mathrm{t}}\).

Fig. 17
figure 17

A deep Autoencoder for data reconstruction

Autoencoders are a specific type of feedforward neural networks, which include an encoder and a decoder. The network reconstructs the inputs by mapping them from a high-dimensional space to a low-dimensional space, thus enabling the hidden layer to learn a better representation of the inputs; and then, the decoder layer reconstructs the results to another space.

As shown in Fig. 17, the fully connected feedforward neural network is composed of multiple layers: an input layer, an encoding layer, an “bottleneck” layer, a decoding layer, and an output layer. Two parts in different shadows are shown to illustrate the encoder and the decoder. The encoder can be represented as mapping through an activation function:

$$f:{\varvec{h}}=f({\varvec{x}})$$
(20)

Likewise, a decoding denotes a mapping:

$$g:{\varvec{r}}=f({\varvec{h}})$$
(21)

which produces a reconstruction. The Autoencoder defines a compound mapping:

$$AE:g\circ f:{\varvec{r}}=g(f({\varvec{h}}))$$
(22)

Let \(\sigma\) be the nonlinear activation function on each layer. Based on weight \({{\varvec{\omega}}}_{1}\) and bias vectors \({{\varvec{b}}}_{1}\) of the encoder, the nonlinear mapping can be expressed as:

$$f:{\varvec{h}}=\sigma ({{\varvec{\omega}}}_{1}{\varvec{x}}+{{\varvec{b}}}_{1})$$
(23)

where \({\varvec{x}}\) indicates the input data. The encoder maps a high-dimensional low-level input vector \({\varvec{x}}\) into a low-dimensional higher-level latent vector \({\varvec{h}}\), which is assumed to nicely encode properties and attribute of \({\varvec{x}}\). Similarly, the decoder map can be expressed as:

$$g:{\varvec{r}}=\delta \left({{\varvec{\omega}}}_{2}{\varvec{h}}+{{\varvec{b}}}_{2}\right)$$
(24)

where \({\varvec{h}}\) indicates the outputs from the encoder and \({\varvec{r}}\) indicates the outputs from the decoder. The deep Autoencoder can be trained with the back-propagation algorithm. Once it is well trained, the compressed data with lower dimensionality can be generated by the encoder.

For different source and target domains, deep Autoencoder helps reconstruct the data expression and reduce the divergence between the source and target domains, so as to keep the favourable properties and attributes of original data sources. A preliminary research by this study in correcting the difference between the source and target domains is conducted mainly from two aspects, as shown in Fig. 18. One aspect is to use deep Autoencoder to directly compress the dataset with higher dimensionality \({\varvec{x}}\) into a domain with low dimensionality \({\varvec{y}}\), which is generally called the encoded feature space; and the compressed data are transformed to a desirable distribution in consistent with the data domain with low dimensionality by using z-transform (Kanasewich, 1981).

Fig. 18
figure 18

Feature-based data reconstruction with Autoencoder

$${y}_{i}=\frac{{x}_{i}-{\mu }_{x}}{{\sigma }_{x}}{\sigma }_{y}+{\mu }_{y}$$
(25)

where \(\mu\) and \(\sigma\) refer to the mean and the standard deviation values, respectively. Another aspect is initiated by compressing the data with higher dimensionality into a divergence feature space, and then the extracted divergence is adopted as extra features, which will be added to the data with lower dimensionality. In this way, the divergence knowledge is directly embedded into the data, with fewer chances of losing valuable information.

5.2 Transfer AdaBoost for learning systems

With the feature reconstruction model introduced in the former section, the source and target domains are transferred to two deeply correlated feature spaces with the same feature size. Both in the source and target domains, there may be labelled instances that might be noisy, thus influencing the model performance. To leverage the labelled instances both in the source and target domains, the Transfer AdaBoost algorithm (Dai, 2007) is further deployed for classification, so as to reduce the influence of negative transfers. In the investigation described in Sect. 4, the Gaussian process has been verified to be a robust model for soil liquefaction, so it will serve as the base model in the Transfer AdaBoost algorithm. The Transfer AdaBoost algorithm treats the source data \({\mathcal{D}}_{\mathrm{s}}\) and target data \({\mathcal{D}}_{\mathrm{t}}\) differently. The label samples in \({\mathcal{D}}_{\mathrm{s}}\) and a small number of labelled samples from \({\mathcal{D}}_{t}\) are adopted as the training datasets. By continuously adjusting the weights of the samples, the data in \({\mathcal{D}}_{\mathrm{s}}\) that are not suitable for \({\mathcal{D}}_{\mathrm{t}}\) can be filtered out. If the labelled data in \({\mathcal{D}}_{\mathrm{t}}\) are incorrectly identified, the weights of the samples will increase in training. The weight updating mechanism (Dai, 2007) for any instance \(\left({x}_{i},{y}_{i}\right)\in {\mathcal{D}}_{\mathrm{s}}\cup {\mathcal{D}}_{t}\) can be deployed by applying a fixed multiplier:

$${w}_{k+1}\left({x}_{i}\right)=\left\{\begin{array}{ll}{w}_{k}^{\mathrm{s}}\left({x}_{i}\right){\beta }^{\left[{h}_{k}\left({x}_{i}\right){y}_{i}\right]}& \text{ if} \, \left({x}_{i},{y}_{i}\right)\in {\mathcal{D}}_{\mathrm{s}}\\ {w}_{k}^{\mathrm{t}}\left({x}_{i}\right){\beta }_{k}^{-\left[{h}_{k}\left({\pi }_{i}\right)\ne {y}_{i}\right\}}& \, {\text{if}} \, \left({x}_{i},{y}_{i}\right)\in {\mathcal{D}}_{\mathrm{t}}\end{array}\right.$$
(26)

where \({\beta }_{k}\) is defined with respect to the errors of prediction for the training samples from target domain \({\mathcal{D}}_{t}\); \({h}_{k}\) refers to the predicted label, and \({y}_{i}\) the true label. Then, a strong classifier can be formulated as: \({h}_{f}(x)={\mathrm{argmax}}_{y} {\sum }_{k=[T/2]}^{T} I\left[{h}_{k}(x)=y\right]\mathrm{log}\left(1/{\beta }_{k}\right)\).

5.3 Results of model evaluation

For model evaluation, the two strategies introduced in Sect. 5.1 about data reconstruction based on different data sources will be compared. In addition, the instance-based Transfer AdaBoost (TrAdaBoost) method will be compared with Kullback–Leibler Importance Estimation Procedure (KLIEP) and Kernel Mean Matching (KMM) methods (Mathelin et al., 2021; Miao et al., 2015; Sugiyama et al., 2008). Dataset 4, due to its abundant samples, is selected as the source data. For deep Autoencoder, the hyperparameters selected are listed in Table 11.

Table 11 Hyperparameter settings for deep Autoencoder

The results of the three instance-based transfer learning methods with the first strategy (denoted by SS1) are compared and evaluated with convergence loss graphs and confusion matrix, with the classification metrics including precision, recall, \({F}_{1}\)-score and CAR. From all the convergence graphs shown in Fig. 19, it can be found that all the loss graphs would converge to a small value around 100 epochs and then reach a plateau. As observed in Fig. 19, loss vs. epoch graphs on both training and validation datasets would decrease to the plateau with a similar small value, and there is no sign of overfitting. From the confusion matrix with Transfer Adaboost shown in Fig. 20, it can be seen that the performance of SS1-TrAdaboost demonstrates higher performance for Datasets 2 and 3 with more data samples for testing as unseen datasets. In details, the evaluation metrics about precision, recall, \({F}_{1}\) and CAR are shown in Table 12, where the symbol “-” represents the worst performance, and some indices are Nan values. Only TrAdaboost yields stable and accurate results, compared with the other two transfer learning algorithms. Also, for Datasets 2 and 3, TrAdaboost shows desirable performance in all the four metrics.

Fig. 19
figure 19

Loss vs. epoch on the training and validation sets for deep Autoencoder with different Datasets by using SS1

Fig. 20
figure 20

Confusion Matrix for Transfer Adaboost algorithm with different Datasets by using SS1

Table 12 Precision, Recall, \({F}_{1}\) and CAR obtained by multiple transfer learning schemes using SS1

For the second strategy (denoted by SS2), extra features are supplemented to reduce the difference between the source and target data, and the loss vs. epoch graphs are shown in Fig. 21. As seen in the convergence graphs, both training and validation loss curves decrease to a small value and later turn to a flat stage after several epochs, and there is no evidence of overfitting for the deep Autoencoder. The confusion matrix yielded by TrAdaboost as listed in Fig. 22 also showcases better performance for Datasets 2 and 3. As shown in Table 13, for the precision, recall, \({F}_{1}\) and CAR, there are no Nan values for KMM and KLIEP, although the performance of KMM and KLIEP is much worse than TrAdaboost.

Fig. 21
figure 21

Loss vs. epoch on the training and validation sets for deep Autoencoder with different Datasets using SS2

Fig. 22
figure 22

Confusion Matrix for Transfer Adaboost algorithm with different Datasets using SS2

Table 13 Precision, Recall, \({F}_{1}\) and CAR obtained by multiple transfer learning schemes using SS2

From the above observations, it can be concluded that with both SS1 and SS2, TrAdaboost yields more stable and accurate predictive results than do KMM and KLIEP. However, for SS1, in the instance-based transfer learning algorithms of KMM and KLIEP, there are cases that prediction labels would bias towards one class, a phenomenon never observed in SS2. In addition, SS2 obtains results slightly better than does SS1 for Datasets 2 and 3. However, one notable observation here is that the performance of TrAdaboost on Datasets 2 and 3 is much better than on other datasets, reaching a level agreeable and comparable with the singly tuned machine learning models. This can be ascribed to the higher correlation between Dataset 4 and Datasets 2 and 3. This matches well the actual observations as described in Sect. 2.1, which shows that there are four features in Datasets 2 and 3 which match well those in Dataset 4.

6 Sensitivity analysis

In this section, the input features will be analysed based on the proposed DDA-GP model, purposed to find out the most influential features in the model evaluation. The Sobol global sensitivity analysis is employed in this research. As a kind of variance-based sensitivity analysis (Sobol, 2001), Sobol's method explores how the uncertainty in the output of a model can be divided into different sources of uncertainty in its inputs.

In this study, the first-order sensitivity index (FOSI) S1 as described in Fig. 23 is employed to unveil the effects of input features on the results by using Sobol's sensitivity analysis. For results on Dataset 1 as shown in Fig. 23(a), the depth of soil D delivers the highest S1 of 40.3%. Moment magnitude \({M}_{\mathrm{w}}\) delivers a minimum FOSI of 1.6%. From Fig. 23(b), it can be concluded that, for Dataset 2, the maximum horizontal ground surface acceleration \({a}_{\mathrm{max}}\) and the total stress \({\sigma }_{v}\) have relatively high S1, i.e., 61.6% and 26%, respectively. The analysis of Dataset 3 shows that the highest S1 is obtained by SPT blow numbers \({N}_{60}^{1}\), with a value of 37.4%. As can be seen in Fig. 23(d), both the total and effective stresses have relatively high S1 of 25.0% and 18.6%, respectively; also F75 has a high S1 value of 19.2%. For Dataset 5, moment magnitude \({M}_{\mathrm{w}}\) plays the most important role with 45.3% FOSI, and \({D}_{50}\) has a minimum FOSI of 1.3%.

Fig. 23
figure 23

Sensitivity analysis of DDA-GP under different datasets

The effects of the shared features between different datasets on the model’s accuracy are further investigated, especially for the depth of soils and the effective and total stresses; the conclusions, however, vary differently. For Dataset 1, the depth of soil layer greatly impacts the model’s accuracy; and both the total and effective overburden stresses also influence the accuracy, but to a smaller extend. For CPT-based dataset, the total stress will influence the model’s accuracy, while the effective stress can be negligible. For SPT-based dataset, the depth of soil has a smaller influence on the model’s accuracy; and both the effective and total effective overburden stresses have larger impacts on the model’s accuracy. As to real historic earthquake cases, the depth of soil has a small influence on the model’s performance, while both the effective and total effective overburden stresses have big influence on the model’s accuracy. The observation of the effects of the three common parameters in SPT-based dataset agrees with that of the historic cases. The effective stress moderately influences the model’s performance for Dataset 5.

7 Conclusion

This study introduces a hybrid hierarchal machine learning model by combining the deep Fisher discriminant analysis (DDA) with the Gaussian process. The DDA-GP model merges the generated deep representative feature sets into original input datasets to enhance learning capacities. To demonstrate the generality of this model, comprehensive experiments on multiple datasets covering several in-situ tests and real cases are performed by using repeated stratified tenfold cross-validation. More specifically, a dataset for a historic liquefaction case of gravelly soil with missing values is preprocessed and explored. Further, the presented machine learning model is compared with several widely used machine learning models, including the state-of-the-art ensemble learning, so as to render a more robust machine learning model regarding soil liquefaction. From evaluation results, it can be clearly observed that the proposed DDA-GP model yields favourable results for all datasets and the Wilcoxon signed-rank test. For other traditional machine learning models, it may be suitable to deal with one certain type of datasets, but fails for other datasets.

Next, this study explores the domain adaption method by integrating a feature-based deep Autoencoder for data reconstruction and a boosting-based transfer learning algorithm in domain adaption of multi-data sources. This method presents a more generalised Gaussian process model for dealing with civil engineering problems with miscellaneous data sources. The domain adaption model is compared based on different feature encoding strategies and with other instance-based transfer learning schemes, finding that by embedding the divergence between the source and target domains into feature spaces, more stable and accurate results will be generated. Furthermore, for datasets with correlations, the domain adaption method can reach an accuracy level comparable to that of signally tuned machine learning models. Finally, a sensitivity analysis is performed to find out the influencing variables for DDA-GP model on all of the datasets. However, as a preliminary research in multi-source transfer learning in the field of civil engineering, this study needs to build a more robust and generalized domain adaption model by using conditional (target-oriented) adversarial deep Autoencoder for mapping between different and complicated feature spaces.