1 Introduction

With rapid economic development, urban communication has become increasingly frequent. In recent years, tremendous intercity railway and metro systems have been constructed [36]. Earth pressure balance (EPB) shield machines are widely used in constructing underground urban tunnels. Tunnel excavations are complex construction projects involving various risks due to uncertain underground conditions and tunnelling processes [14, 16, 21, 38]. In the excavation process, one of the main aims is to optimize the performance of the tunnelling system and reduce risks during shield tunnelling [3, 47, 48, 53]. Many factors influence shield performance, cost, and project progress [15, 17, 42]. Geological characteristics are key uncertainty factors that significantly influence schedule performance [5, 35, 42, 43]. The characteristics can be estimated from the geological profile based on borehole information before shield excavation. However, the geological profile cannot accurately reveal the actual geological features of every lining ring. Uncertainty in geological features may result in low shield efficiency, high cost, and project delays. To ensure the successful completion of a tunnel project, the construction project management team is encouraged to refine measurements to determine the geological conditions before shield tunnelling. These geological characteristics can be automatically determined in the clustering model through the selected principal factors. This can assist shield operators in setting shield parameters and reducing construction risk during shield tunnelling.

The geological profile consists of various geomaterial zones owing to complex geological processes [20]. Prior to project construction, geological information can be obtained through geological surveys [4]. The geological formations in a region between one borehole and another (cross-borehole region) are inferred by various soil or rock layers linear connection [13]. The inferring geological formations is a human-intensive process with only moderate reliability [13]. Improvement of the reliability of inference can lead to more efficient operational planning and better geological assessment. Researchers have proposed many methods, e.g. neural network, random field model and geo-genetic map, to investigate cross-borehole soil-geological interpolation [20, 26]. For shield tunnelling, the shield parameters are related to geological characteristics, which can be employed to reveal the variation of geological conditions [32]. Unexpected geological conditions occurred in front of tunnel face can significantly affect the safety of shield excavation [11, 22, 34, 51]. Thus, some assessment models were established to predict the unexpected geological conditions [1, 2, 9]. Recently, with the development of artificial intelligence technology, researchers proposed many prediction models based on expert system, machine learning, and deep learning to determine GC ahead of shield cutterhead to improve the efficiency of shield tunnelling [6, 24, 25]. However, the existing models are mostly supervised models based on the data after the completion of shield construction [49, 50]. It still has a high demand for real-time identification of geological features for each lining ring along a shield tunnel based on unsupervised models without considering human influences [16, 19, 29]. Therefore, it is necessary to building up an unsupervised machine learning model to identify the unknown geological features ahead of shield cutterhead independently instead of the known geological features.

The objective of this study is to present the relationship between shield parameters and GC as well as to develop an unsupervised identification model using a fuzzy clustering algorithm. In this study, the identified factors comprise six shield parameters: advance rate (AR), thrust (F), cutterhead torque (T), cutterhead rotation speed (CRS), upper earth pressure (UEP), and lower earth pressure (LEP). They also include four transformed factors: penetration rate (PR), field penetration index (FPI), torque penetration index (TPI), and specific energy (SE). The dimension of identified factors was reduced to six principal components through PCA. A clustering model for GC was developed by the fuzzy C-means algorithm based on dimensionally reduced data. The fuzzy partition coefficient and silhouette coefficient were used to determine the number of GC. Sensitivity analysis was employed to evaluate the influence of output by individual factors. The Guangzhou–Foshan intercity railway project in China was employed as a case study to demonstrate the feasibility and capability of the developed clustering model.

2 Methodology

2.1 Framework of identifying geological characteristics

The framework of identifying GC (see Fig. 1) consists of three phases: data processing, fuzzy C-means implementation, and geological characteristic identification. Data processing involves determining the identified factors of geological features. The shield parameters are optimally related to geological conditions. Thus, shield parameters were collected via sensors in the shield machine and processed as a standard matrix. Information on the geological profile and geological conditions recorded after shield tunnelling was also collected for reference. Then, a database based on the weights of principal components using PCA was created. In the fuzzy C-means method, the types of GC were determined via the elbow method, silhouette coefficient, and fuzzy partition coefficient. The fuzzy C-means method was developed by integrating the K-means +  + algorithm (Sect. 2.3) with a fuzzy set and membership function. A clustering model was established based on the fuzzy C-means method. Finally, geological characteristics were identified using the proposed model. Then, the results of identification can provide early warning information to the engineers to arrange the schedule of construction [53].

Fig. 1
figure 1

Framework of identifying geological characteristics

2.2 Principal component analysis

A widely used method for reducing data dimensionality is PCA. It can be applied to extract the principal components of data without losing substantial information through an orthogonal transformation. The extracted k (k < m) linearly independent variables can reflect the main characteristics of original database and conduct data visualization [10, 26]. Therefore, the original matrix with m dimensions can be compressed to k dimensions to improve the computational effectiveness of machine learning models. The PCA process has six main steps.

(1) The original database contained n pieces of m-dimensional data. The original database matrix can be constructed using Eq. (1):

$$X{ = [}x_{ij} {]}_{{n \times {\text{m}}}} { = }\left[ {\begin{array}{*{20}c} {} & {C_{1} } & {C_{2} } & {C_{3} } & \cdots & {C_{m} } \\ {X_{1} } & {x_{11} } & {x_{12} } & {x_{1,3} } & \cdots & {x_{1m} } \\ {X_{2} } & {x_{21} } & {x_{22} } & {x_{23} } & \cdots & {x_{2m} } \\ {X_{3} } & {x_{31} } & {x_{32} } & {x_{33} } & \cdots & {x_{3m} } \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ {X_{n} } & {x_{n1} } & {x_{n2} } & {x_{n3} } & \cdots & {x_{nm} } \\ \end{array} } \right]$$
(1)

where Cm represents factor m, and Xn is the nth data in Cm.

(2) The original data were standardized using Eq. (2):

$$X_{ij} = \frac{{x_{ij} - \overline{x}_{m} }}{{S_{m} }}$$
(2)

where \(X_{ij}\) is the standardized data, \(\overline{x}_{m}\) is the average value of the mth column data, and Sm is the standard deviation of the mth column data.

(3) Calculate the covariance of standard data and establish a covariance matrix; Covjk is the covariance between the standardized data of factors j and k and can be calculated by following equation:

$$Cov_{jk} = \frac{1}{n - 1}\sum\limits_{i = 1}^{n} {\left( {X_{ij} - \overline{X}_{j} } \right)} \left( {X_{ik} - \overline{X}_{k} } \right)$$
(3)

where Xij is the standardized data of row i and column j; Xik is the standardized data of row i and column k; and \(\overline{X}_{j}\) and \(\overline{X}_{k}\) are the average values of the standardized data of factors j and k. The covariance matrix of the m-dimensional data can be constructed using Eq. (4):

$$Cov = \left[ {\begin{array}{*{20}c} {Cov_{11} } & {Cov_{12} } & {Cov_{13} } & \cdots & {Cov_{1m} } \\ {Cov_{21} } & {Cov_{22} } & {Cov_{23} } & \cdots & {Cov_{2m} } \\ {Cov_{31} } & {Cov_{32} } & {Cov_{33} } & \cdots & {Cov_{3m} } \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ {Cov_{m1} } & {Cov_{m2} } & {Cov_{m3} } & \cdots & {Cov_{mm} } \\ \end{array} } \right]$$
(4)

(4) Calculate the eigenvalues and corresponding eigenvectors of the covariance matrix.

(5) Arrange the eigenvectors according to the value of the corresponding eigenvalues from left to right into a matrix in columns and then take the first k columns to form a matrix (P).

(6) The original data can be transformed into k-dimensional data using Eq. (5):

$$Y = PX$$
(5)

2.3 Original K-means +  + clustering algorithm

The K-means +  + algorithm, which is an unsupervised clustering algorithm, is an improvement of the K-means (or C-means) algorithm. The selected initial cluster centres have a crucial influence on the results and running time of the K-means algorithm. If the cluster centres are randomly selected. Then, the K-means algorithm converges extremely slowly. The K-means +  + algorithm was improved by selecting the initial cluster centres. The architecture and working procedures of K-means +  + can be conducted as follows.

Step 1: Input the database, D = {x1, x2, …, xm}, and number of clusters (K).

Step 2: Randomly select the first cluster centre (μ1) in the database.

Step 3: Calculate the distance D(xi) = arg min||xi-μr||2 (r = 1, 2, …, K) between each sample (xi) and nearest cluster centre (μr).

Step 4: Select the next cluster centre using the roulette method based on probability P = D(xi)2/(ΣD(xi)2), where each sample is selected as the next cluster centre.

Step 5: Conduct Steps 2 and 3 till to select K cluster centres, i.e. (μ1, μ2, …, μk).

Step 6: Divide the database into K initial clusters, Ct = ∅ (t = 1, 2, …, K).

Step 7: Calculate the distance (dij = || xi-μj||2) between each sample (xi, i = 1,2, …, m) and every cluster centre (μj, j = 1,2, …, K). Then, assign the sample (xi) to the category (Cj) with the least distance, and update clusters Cj = Cj∪{xi}.

Step 8: Calculate a new cluster centre for each cluster. The cluster centre is calculated as follows:

$$\mu_{j} = \frac{1}{{N_{j} }}\sum\limits_{i = 1}^{{N_{j} }} {x_{i} }$$
(6)

where μj is the cluster centre of Cj, and Nj is the number of xi in Cj.

Step 9: Steps 7 and 8 were conducted again till to obtain stable cluster centres.

Step 10: Output K clusters C = {C1, C2,..., Ck}.

The number of clusters (K) is a crucial parameter in the K-means ++ algorithm. However, the clusters in K-means ++ and actual situations may be different because K is changed and selected manually according to experience. In this study, the number of clusters was selected based on the elbow method and silhouette coefficient. The sum of square errors (SSE) in the elbow method and different values of K are in the form of several elbows, and K is determined according to the largest value of the silhouette coefficient corresponding to the elbows. The SSE and Si values can be calculated as follows [31, 33]:

$$SSE{ = }\sum\limits_{j = 1}^{K} {\sum\limits_{i = 1}^{{N_{j} }} {\left| {x_{i} - \mu_{j} } \right|^{2} } }$$
(7)
$$\begin{gathered} S_{i} = \frac{{b_{i} - a_{i} }}{{\max \left\{ {a_{i} ,b_{i} } \right\}}}{ (}i = 1,2,...,N_{j} ) \hfill \\ b_{i} = \min \left\{ {b_{i1} ,b_{i2} ,...,b_{ik} } \right\} \hfill \\ \end{gathered}$$
(8)

where ai represents the mean distance from sample i to other samples in the same category, and bik denotes the mean distance from sample i to other samples in different categories.

2.4 Fuzzy C-means clustering algorithm

The initial cluster analysis was a hard cluster [18]. The samples were categorized into different classes with distinct boundaries. However, most objects do not possess explicit attributes; the characteristics and attributes of these factors are intermediate. For classifying objects, the soft clustering method is acknowledged to be more suitable [2, 7]. To provide an analysis method for this type of soft cluster, Zadeh (1965) proposed the fuzzy set theory [46]. Subsequently, to solve the problem of soft clusters, a fuzzy cluster analysis, which can express the uncertainty degree of objects and reflect the actual characteristics of samples, was implemented.

Dunn (1973) and Bezdek (1981) developed a fuzzy C-means algorithm based on the fuzzy set theory and K-means algorithm [2, 7]. Generally, objects in a database cannot be divided into distinct categories. However, the objects can be assigned a weight to indicate the degree of belongingness to the clusters using the fuzzy C-means algorithm. In this algorithm, a database is given by X = {X1, X2, …, XN} ⊂ Rp, where Rp represents an n-dimensional real-vector space, and μik is the membership function of sample k belonging to cluster i. The clustering loss function based on the membership function in the fuzzy C-means can be expressed as follows:

$$J(U,V) = \sum\limits_{k = 1}^{N} {\sum\limits_{i = 1}^{C} {(u_{ik} )^{m} (d_{ik} )^{2} } }$$
(9)

where 0 < uik ≤ 1; N represents the samples number; C denotes the clusters number; 1 ≤ k ≤ N; 1 ≤ i ≤ C; m is the weight index; dik =||xk-vi|| is the Euclidean distance between sample k and centre in cluster i; vi is the centre of cluster i. The fuzzy C-means cluster criteria are employed to calculate the minimum value of the clustering loss function. The membership function (uik) and vi can be expressed as follows:

$$u_{ik} = \left( {\sum\limits_{j = 1}^{C} {(\frac{{d_{ik} }}{{d_{jk} }})^{{\frac{2}{m - 1}}} } } \right)^{ - 1}$$
(10)
$$v_{i} = \frac{{\sum\nolimits_{k = 1}^{N} {(u_{ik} )^{m} x_{k} } }}{{\sum\nolimits_{k = 1}^{N} {(u_{ik} )^{m} } }},{ (}i{ = 1, 2, }...{, }C),{ (1 < }m)$$
(11)

The fuzzy C-means algorithm involves four main steps. (1) Input the number of clusters (C), weight index (m), and initial membership function (U0). (2) Calculate the cluster centres according to Eq. (11). (3) Update the membership matrix according to Eq. (10). (4) Given ε > 0, if max{|uik(t)-uik(t-1)(|} < ε, then stop the algorithm, and output the cluster centres and members; otherwise, return to step (2).

3 Data preparation

3.1 Project review and data acquisition

The database used in this study was collected from a tunnel of Guangzhou–Foshan intercity railway located in Guangzhou City, Guangdong Province, China. The geological profile of the construction site in the longitudinal tunnel direction is shown in Fig. 2. The total length of the tunnel studied herein was approximately 3.11 km. EPB shield machines were used to excavate the double-line tunnels. The cutterhead diameter of the EPB shield machine was 9.15 m. A shield tunnel-lining ring with outer and inner diameters of 8.8 and 8.0 m were assembled into six segments, respectively. The width and thickness of a segment were 1.8 and 0.4 m, respectively. The database used in this study consists of shield parameters recorded by an acquisition system with sensors in the shield machine. To improve the analysis of the correlation between the geological features and response of the shield machine, other relevant parameters calculated using raw shield parameters were also adopted to establish the clustering model.

Fig. 2
figure 2

Geological profile of construction site in longitudinal tunnel direction

3.2 Data pre-processing

The database includes 10 factors influencing the identification of GC. These factors (defined above) include six raw shield parameters (i.e. AR, F, T, CRS, UEP, and LEP) and four calculated parameters (i.e. PR, FPI, TPI, and SE). The six raw factors, which are direct responses of the shield machine and reflect the change in geological conditions, have been recorded by the sensors in the shield machine. During excavation, the thrust force and cutterhead torque may be affected by the cutterhead diameter. To reduce the influence of cutterhead size, the thrust and cutterhead torque are transformed as the mean thrust force (MF) and mean cutterhead torque (MT), respectively, according to the following empirical equation [39]:

$$MF{ = }F/A{\text{ and }}MT = T/D^{3}$$
(12)

where A and D are the area and diameter of the cutterhead, respectively. Parameters PR, FPI, and TPI have also been introduced to evaluate the performance of shield machines and identify geological features [12, 52]; these parameters can be calculated as follows [40]:

$$\begin{gathered} PR \,\,\,\, = AR/CRS, \hfill \\ \, FPI = F/PR, \hfill \\ \, TPI = T/PR. \hfill \\ \end{gathered}$$
(13)

The energy required for extracting soil or rock per volume is SE, which changes according to geological conditions [30]. SE can be calculated by the four main shield parameters (i.e. AR, F, T, and CRS) and expressed as follows [41]:

$$SE = \frac{F}{A} + \frac{2\pi \cdot T \cdot CRS}{{A \cdot AR}}$$
(14)

3.3 Data correlation analysis

Correlation analysis shows the relationship among different variables, which gives a straightforward understanding of the data. Sensitivity analysis is a primary modelling tool that can be useful in the evaluation of the correlation relationship between input variables and objectives. It is precious in construction management and source allocation support [37]. This study adopts Spearman correlation coefficient to show the relationship between input variables and GC. Spearman correlation coefficient is calculated using the value grade rather than the value itself to better describe the correlation between the shield parameters and GC. Spearman correlation coefficient can be calculated as follow:

$$\rho_{s} = \frac{{\sum\nolimits_{i = 1}^{N} {\left( {R_{i} - \overline{R} } \right)\left( {S_{i} - \overline{S} } \right)} }}{{\left( {\sum\nolimits_{i = 1}^{N} {(R_{i} - \overline{R} )^{2} \sum\nolimits_{i = 1}^{N} {(S_{i} - \overline{S} )^{2} } } } \right)^{\frac{1}{2}} }}$$
(15)

where ρs is the Spearman correlation coefficient; Ri and Si are the ith measured value grades of the two factors above; \(\overline{R}\) and \(\overline{S}\) are the mean value grades of the two factors, respectively; and N represents the number of measured data samples.

Figure 3 shows the heatmap of Spearman correlation coefficient results for variables in clustering GC, which can better describe the correlation between variables in the database. It can be observed that parameters PR and AR are the most critical factors for identifying GC, and they are negatively correlated with geological features with the values of Spearman correlation coefficient at − 0.74 and − 0.69, respectively. The soil and rock were excavated by cutterhead during shield tunnelling. The soft soil was compressed in the process of shield advance and thus more soil would be excavated into soil chamber by cutterhead rotation. Therefore, the shield machine should advance more distance to avoid collapse [1, 3]. Moreover, the CRS was at a low level to ensure a certain volume of soil entering the soil chamber to prevent overbreak. Finally, soft geological features result in high penetration and advance rate. Additionally, it requires less thrust force and cutterhead torque to excavate soft soil at the same advance distance due to the low strength of the soft ground. Hence, the FPI and TPI are at a low level in soft GC [32, 45]. Moreover, the definition of SE is the energy requirement for per volume of soil or rock excavation. It requires more energy to excavate hard rock than soft soil. Therefore, the comprehensive indexes, e.g. TPI, SE, and FPI, are positively correlated with geological features [8].

Fig. 3
figure 3

Heatmap of Spearman correlation coefficient results for variables in clustering geological characteristics

3.4 Data features extraction using PCA

PCA provides an approach to compress data dimensions and improves the computational efficiency of proposed models. In general, the control system for the shield machine recorded hundreds of parameters using sensors. All these parameters varied with the change of geological conditions. In the case study, ten factors highly related to geological features were selected for model establishment. Ten factors with mass data and noise will increase the time and difficulty of the clustering process. To improve the separability of data pertaining to geological features and save clustering time, data features are extracted by PCA. In this study, to determine the value of dimensions (k), the input data weight was calculated using eigenvalues. The weight and cumulative percentage of the principal components are shown in Fig. 4. The cumulative percentage of the weights of the first six principal components reached 99%; accordingly, the first six principal components were selected to analyse the geological characteristics. The original 10-dimensional factors were transformed into 6-dimensional principal components using PCA.

Fig. 4
figure 4

Weight and cumulative percentage of principal components

4 Detection and classification of geological characteristics

4.1 Pre-classification of geological characteristics

Before the GC classified using the shield parameters, they were categorized into three types based on the rock and soil records obtained during the preliminary geological survey of the construction site and shield tunnelling. The formations of shield excavation were labelled as follows: (i) formation with soft soil, (ii) formation with soft soil and hard rock, and (iii) formation with full-section hard rock. However, the classification of GC between two boreholes according to the geological survey was not precise compared with the actual geological conditions. Additionally, the variation in geological conditions based on the soil and rock record during shield tunnelling was inaccurate because of observation errors. The proportions of different geological characteristics according to the preliminary survey of the entire construction site and 1197 records of shield tunnelling in two lines are shown in Fig. 5

Fig. 5
figure 5

Percentage of geological characteristics in a entire construction site and b records of shield tunnelling

4.2 Classification of geological characteristics

4.2.1 K-means +  + clustering model

Geological characteristics critically influence the shield machine parameters. Hence, the parameters can reflect variations in GC. The geological conditions are usually continuous, resulting in the aggregation of shield parameters. These characteristics can be clustered using K-means +  + with shield parameters. To determine the number of clusters, the elbow method and silhouette coefficient were employed to select the values of K. The results of the elbow method and silhouette coefficient are shown in Fig. 6. The original 10-dimensional input data and 6-dimensional data after dimension reduction were applied to the cluster of GC. When K was 3 or 4, elbow points were formed. The degree of the elbow was regarded as the SSE ratio [28], which can be expressed as

$$SSE{\text{ ratio}} = \frac{{SSE_{K + 1} - SSE_{K} }}{{SSE_{K} - SSE_{K - 1} }}$$
(16)

where SSEK is the SSE of the K-means +  + algorithm corresponding to K; Table 1 lists the SSE ratio values. The smaller the SSE ratio, the greater the elbow degree. The SSE ratio was smallest when K was 4. However, the value of the silhouette coefficient is largest when K is equal to 3. Therefore, geological characteristics can be clustered into three or four types using K-means +  + . Then, the clustered types of GC are compared with the record obtained during shield tunnelling and geological survey to validate the reliability of K-means +  + algorithm. Additionally, the value of SSE using 6-dimensional data is less than that using the original 10-dimensional input data. Moreover, the value of the silhouette coefficient using the 6-dimensional data is more considerable than that using the original 10-dimensional input data. The comparison indicates that the clustering results using PCA data are better than the original data.

Fig. 6
figure 6

Values of K corresponding to sum of square errors and silhouette coefficient of K-means +  + 

Table 1 SSE ratio of K-means +  + algorithm

4.2.2 Fuzzy C-means clustering

The geological characteristics were classified into three and four types using the K-means +  + algorithm. However, the GC were fuzzy during shield tunnelling. To improve the clustering accuracy, the fuzzy C-means algorithm using shield parameters was employed to identify GC. The fuzzy partition coefficient and silhouette coefficient were used to determine the types of GC. The fuzzy partition coefficient is defined within the range 0–1 (1 regarded as the best). This metric shows the level of accuracy that the data are described by a specific model. The essential parameters were set in the fuzzy C-means, and the database was clustered between two and eight clusters. Then, the fuzzy partition coefficient can be calculated using the clustering results. The data are easily clustered when the fuzzy partition coefficient is maximized. The fuzzy partition coefficient and silhouette coefficient of the fuzzy C-means with different values of K are shown in Fig. 7. To identify the types of geological features, the original and dimensionally reduced data were input into the fuzzy C-means clustering model. The fuzzy partition coefficient and silhouette coefficient of fuzzy C-means are highest when K is equal to 3. Therefore, the best option for geological characteristic clusters was to select three types. For comparison with the results of the K-means +  + clustering model, four types of GC are also selected to establish the fuzzy C-means clustering model.

Fig. 7
figure 7

Values of K corresponding to fuzzy partition coefficient and silhouette coefficient of fuzzy C-means

5 Results and discussions

5.1 Visualization and comparison of results

The TPI value indicates the ability of the earth to resist tunnel formation. The FPI value reveals the ability of earth or rock to resist the cutter with an external force. Both TPI and FPI are applied to plot the two-dimensional feature space of clustering results. The first two principal components were used to display and compare the results of the clustering models. The geological characteristics of the record during shield tunnelling are shown in Fig. 8. Clusters 1–3 correspond to the formations with soft soil, soft soil and hard rock, and hard rock, respectively. The geological features cannot be divided into three clusters with distinct boundaries using the original data with the standard FPI and TPI, as shown in Fig. 8(a). However, the GC can be classified better into different clusters with dimensionally reduced data using PCA (i.e. PCA data), as shown in Fig. 8(b); these data can be better used to identify geological characteristics. Geological features were recorded after shield tunnelling. Some geological characteristic points were jumbled in different categories, indicating that the characteristics did not correspond with shield parameters because of special construction conditions (e.g. encountering boulder or formation reinforcement) recorded in the construction log.

Fig. 8
figure 8

Geological characteristics in record after shield tunnelling using a original and b PCA data

The visualization of the clustering results of three types of geological characteristics using K-means ++ and fuzzy C-means with the original and PCA data is shown in Fig. 9. The results of the three types of GC are different between the K-means ++ and fuzzy C-means clustering models (e.g. samples A and B in Fig. 9(a) and (c), using original data; samples C and D in Fig. 9(b) and (d), using PCA data). The results of the four types of GC shown in Fig. 10 are compared with those of the three types of GC. Clusters 1–4 correspond to the formations with soft soil, mainly containing soft soil, mostly containing hard rock, and hard rock, respectively. Sample A in Fig. 10(a) and (c), and sample B in Fig. 10(b) and (d) are also classified into different clusters. Additionally, the results of GC can be facilely identified in the clustering models using PCA data, as shown in Figs. 9 and 10. The silhouette coefficients of clustering models using PCA data were greater than those obtained using the original data (Table 2). Therefore, the results of clustering models with PCA data can be better employed to analyse the various types of GC ahead of the cutterhead during shield tunnelling.

Fig. 9
figure 9

Visualization of three types of geological characteristics using a K-means ++ with original data, b K-means ++ with PCA data, c fuzzy C-means with original data, and d fuzzy C-means with PCA data

Fig. 10
figure 10

Visualization of four types of geological characteristics using a K-means +  + with original data, b K-means +  + with PCA data, c fuzzy C-means with original data, and d fuzzy C-means with PCA data

Table 2 Silhouette coefficient of clustering models

5.2 Analysis of clustering model results

To analyse the different results of clustering models during shield tunnelling, the geological profile and characteristics recorded manually are compared with the results of the three and four types of GC using K-means +  + and fuzzy C-means with PCA data in Fig. 11. Compared with the geological features in the record, GC in the formation containing a mix of soft soil and hard rock after the 250th lining ring did not correspond to the geological profile. It further indicates that the profile drawn by connecting different boreholes was inaccurate. Variations among the results of clustering models exist in five areas: areas A–E. The results accuracy of the three types of GC in K-means +  + and fuzzy C-means was compared with the recorded GC. As summarized in Table 3, the identification rate exceeds 90%. As listed in Table 4, the GC in the 58th lining ring in the west line (W58, sample c1 in Fig. 11(c)) indicates that moderately weathered granite is encountered. It was better classified into cluster 3 with a membership of 0.42 rather than cluster 1 (membership: 0.36) or cluster 2 (membership: 0.22). Additionally, the GC in the W167 and W176 lining rings (sample a1 in Fig. 11a and samples c2, c3 in Fig. 11c) were classified into cluster 3 with membership values of 0.75 and 0.96, respectively, because of the large volume of rock, as recorded in the construction log. However, they were identified as clusters 1 and 2 in K-means +  + . The GC of the W59, W256, and W596 lining rings were incorrectly classified into cluster 1 in the K-means +  + model. However, they were recognized as cluster 2 with membership values of  0.64, 0.96, and 0.99 in the fuzzy C-means, respectively. This illustrates that the fuzzy C-means model can better identify GC compared with the K-means +  + model. Regarding the unexpected GC in the 77th lining ring in the east line (E77), the formation of soft soil including boulders was recorded as cluster 1 and identified as clusters 2 and 3 in K-means +  + and fuzzy C-means, respectively. The membership value of 0.54 in cluster 3 shows the excavation difficulty of the shield machine. Although the GC of the E77 lining ring are formation with soft soil, classifying it into cluster 3 in the fuzzy C-means is reasonable. Additionally, the GC of the W454 lining ring, which includes small boulders, recorded as a formation with soft soil, were also classified into cluster 2 with soft soil and hard rock in the clustering models.

Fig. 11
figure 11

Geological profile and geological characteristics in west line: a three types in record; b three types in K-means +  + ; c three types in fuzzy C-means; d four types in K-means +  + ; and e four types in fuzzy C-means

Table 3 Results of clustering models vs. geological characteristics in the record
Table 4 Results of three types of geological characteristics in clustering models

Compared with the three types of GC, the major difference among the four types of results in the clustering models was the proportion of hard rock. The formation with soft soil and hard rock (cluster 2) in the three types of GC was divided into a formation in which the majority is soft soil (cluster 2) and another formation in which the majority is hard rock (cluster 3) in four types of GC. The GC of the W308 and W309 lining rings (sample b1 in Fig. 11b and sample c4 in Fig. 11c, respectively), which includes small pieces of hard rock recorded as a formation with soft soil and hard rock, are classified into cluster 1 in the clustering models with three types of GC. However, they were identified as cluster 2 in the clustering models with four types of GC, which shows the fuzzy boundary of clustering models with three types of GC and high sensitivity of clustering models with four types of GC. Furthermore, the GC in area b2 in Fig. 11(b) and area c5 in Fig. 11(c) were identified as cluster 2 of four types of GC in the clustering models. Additionally, the GC of the W370 lining ring (sample d1 in Fig. 11d and sample e1 in Fig. 11e) were classified into cluster 3 of four types of GC rather than cluster 2 of three types of GC, indicating that the formation in the W370 lining ring mainly consists of hard rock. In general, the geological features were recorded by site engineers based on the soil or rock conveyed out. However, the soil and rock were mixed in the soil chamber. The soil or rock of previous lining ring was not conveyed out completely from the soil chamber when geological features have changed, which will result in misidentifying GC by manual. Although no widely acceptable expressions for shield parameters in various strata, there is a mechanical relationship between shield parameters (e.g. F and T) and GC [39, 52]. Thus, the variation of shield parameters is the response of shield machine for the change of GC [23, 24, 45]. The fuzzy C-means model can accurately identify the change of GC based on response of shield machine instead of engineers’ experience. Therefore, most of the inconsistencies between the model results and the manually recorded geological features occur where the recorded geological features have changed.

To compare the results in different clustering models, four types of GC between the 420 and 600 lining rings on the west line are shown in Fig. 12. Table 5 also summarizes the differences in the results among the four types of GC in the clustering models. The GC in the W429, E462, and E581 lining rings were classified into cluster 3 with membership values of 0.58, 0.51, and 0.52 in the fuzzy C-means, respectively. However, the values of the membership in cluster 2 are 0.40, 0.46, and 0.43 in the fuzzy C-means, respectively,  less than that in cluster 3, which indicates that the proportion of hard rock exceeds that of soft soil in these lining rings. Moreover, the formations of W264, W453, and W491 are reinforced by grouting from the ground surface. They were clearly identified into cluster 3 with membership values of 0.967, 0.90, and 0.93, respectively.

Fig. 12
figure 12

Four types of geological characteristics between rings 420 and 600 in west line: a K-means +  + and b Fuzzy C-means

Table 5 Results of four types of geological characteristics in clustering models

5.3 Discussion

5.3.1 Analysis of shield parameters and geological characteristics

It has been acknowledged that the shield parameters vary with the geological features during shield tunnelling [24]. Researchers have proposed empirical and semi-empirical formulas based on field data and numerical simulation [25, 27]. However, no widely acceptable expressions for shield parameters exist in various strata [39]. The calculation of a single shield parameter (e.g. F and T) in different strata is a nonlinear problem [52]. Therefore, this study adopted the comprehensive indexes (FPI, TPI, and SE) to estimate the geological conditions [44]. Recently, some researchers proposed prediction models to forecast shield parameters and evaluate rock mass classes during shield tunnelling based on machine learning technology and field monitoring data [32, 49, 50, 54]. However, the shield parameters were classified into several categories in recent studies, which didn’t analyse the correlation between actual geological features and shield parameters [23, 45, 56]. To reveal the relationship between comprehensive indexes and GC, geometric and geological parameters were collected from sample soils extracted from boreholes via laboratory tests [26]. Eight tunnel sections with different soil–rock ratios corresponding to the borehole locations were selected. Then, the geological parameters were calculated based on soil/rock ratios and the suggested values in the geological survey report. Finally, the capacity eigenvalue (CE) of the strata was selected to represent GC based on statistical analysis [23]. The relationship between the comprehensive indexes and CE is shown in Fig. 13. The soil–rock ratios of tunnel sections and fitted curves are also presented. In this case study, the shield tunnel mainly passes through moderately, intensely, and completely weathered granite as well as silty clay. The suggested CEs of moderately, intensely, completely weathered granite and silty clay are 1500, 500, 200, and 180 kPa, respectively. Moderately weathered granite is considered as hard rock, and the completely weathered granite and silty clay represent soft soil. The strength of intensely weathered granite was slightly higher than that of the completely weathered granite. The fitted curves and 95% confidence intervals between FPI, TPI, SE, and CE are presented in Fig. 13 with a considerably positive correlation. The fitting curves coefficients (R2) for FPI, TPI, and SE are 0.98, 0.91, and 0.91, respectively. Hence, the FPI, TPI, and SE reflect the GC corresponding to the correlation analysis results. Therefore, the identification of GC using shield parameters and comprehensive indexes is reasonable [56].

Fig. 13
figure 13

Relationship between comprehensive indexes and CE of soils/rocks: a FPI vs. CE, b TPI vs. CE, and c SE vs. CE

5.3.2 Application of clustering model

Geological features critically influence the construction efficiency and energy consumption during shield tunnelling [54]. An increase in the proportion of hard rock can result in cutterhead wear. Four types of GC classified by the fuzzy C-means can provide the membership and indicate the associated degree of the type of GC, providing better guidance for shield parameter setting and shield maintenance. The relationship between the GC and efficiency of shield tunnelling is illustrated in Fig. 14. The degree of lining ring line slopes reflects the shield tunnelling efficiency. Within a certain period, the completion of more lining rings implies a higher shield tunnelling efficiency. As shown in Fig. 14, shield tunnelling in the formation with soft soil has the best efficiency and the least efficiency in the formation with hard rock.

Fig. 14
figure 14

Relationship between geological characteristics and efficiency of shield tunnelling

Identification of GC is essential for shield construction management. The proposed fuzzy C-means model can be used in other geological conditions for shield tunnelling. However, it should be noted that geological features (e.g. rock strength and capacity eigenvalue) in various regions vary substantially [26]. The fuzzy C-means model can be employed to identify GC in the same region based on the established correlation in the database. It is expected to establish an extensive database based on the relationship between input parameters and GC in various regions [55]. The fuzzy C-means model should be trained and validated in a new shield tunnelling engineering before its application to determine GC during shield tunnelling. When a shield machine encounters a formation with soft soil, the operator can adjust the shield parameters to accelerate shield tunnelling [56]. The proportion of hard rock has a crucial influence on disc cutter wear and energy consumption. When a formation with hard rock was identified during shield tunnelling, the site engineers can schedule a proper time for changing the cutters. Hard rock can be cut better by new cutters, which can result in more energy savings in the context of global energy conservation and emission reduction. Additionally, the identification of GC can be employed to monitor various formations and provide early warning to ensure the safety of shield tunnelling.

6 Conclusions

A fuzzy clustering model is proposed in this study to identify geological characteristics of mixed rock–soil strata during shield tunnelling. Based on borehole data and operational parameters, the geological features can be determined accurately by the proposed fuzzy C-mean model integrating K-means and fuzzy set. The following conclusions can be drawn:

  1. 1)

    The PCA data with low dimensions compressed from original data by principal component analysis (PCA) can be better used to identify geological characteristics. The sum of squared errors ratio (SSE ratio), silhouette coefficient, and fuzzy partition coefficient of clustering models were improved using the PCA data. In addition, the boundaries in the feature space drawn using the PCA data were more precise than using the original data.

  2. 2)

    The correlation analysis results indicate that torque penetration index (TPI), field penetration index (FPI), and specify energy (SE) were considerably positively correlated with geological characteristics. The average capacity eigenvalue (CE) was selected to reflect geological features. The fitting curves coefficients (R2) of FPI, TPI, and SE for CE are 0.98, 0.91, and 0.91, respectively, which could be utilized to show the distribution of geological characteristics in the feature space.

  3. 3)

    The geological characteristics could be classified into four types in the fuzzy C-means clustering model based on elbow method, silhouette coefficient, fuzzy partition coefficient, and the geological profile indicated by the borehole. The proposed fuzzy C-means model could provide the membership to reveal the proportion of soil-rock, which has better performance than the hard clustering model, e.g. K-means +  + .

  4. 4)

    It should be noted that the proposed fuzzy C-means model depends on extensive data in a region to identify geological characteristics. The proposed model should be trained and validated in another area because the geological features vary significantly from region to region. Additionally, there may be misclassification when the shield encounters boulders.