1 Introduction

With the rapid development of urban rail transportation, ballastless track has been widely used in many metro lines because of its advantages of low periodic maintenance and greater vehicle stability [1]. For metro shield tunnels, the track structures consist of rail, fasteners, track slab, subgrade and tunnel lining. The track slab and subgrade are usually constructed as monolithic cast-in-place concrete structures on the tunnel lining. However, under continuous cyclic trainloads, cracks often occur at the interface between the subgrade and the tunnel lining near the expansion joints of track slab [2]. Over time, this type of cracks will further propagate along the cross-sectional interface, ultimately affecting damage to other vehicle components or track structures [3]. Before such damage issues can be detected visually, the track structures have not been able to achieve normal service performance, which will endanger the operation of trains. Therefore, health condition monitoring and damage detection of rail track structures, especially the early-stage damage identification, is critically important [4,5,6].

As the vibration properties (e.g., the natural frequency, amplitude, and damping ratio) of structural system vary with the mechanical parameters and boundary conditions, many researchers attempted to utilize the variation of vibration properties to detect the structural health condition [7,8,9,10]. And, the commonly used methods for vibration signal analysis included empirical mode decomposition (EMD) [11, 12], short-time Fourier transform (STFT) [13, 14], wavelet transform (WT) [15, 16], etc. In general, the traditional vibration signal-based health condition evaluation in the frequency domain required pre-processing of the raw signals. Hence, complex mapping and transformations were performed to filter out noise and redundant information to highlight the information related to the damage. Finally, eigenvalues or feature vectors that best characterize the damage were extracted.

However, the noise is a challenging issue to deal with. In practical engineering, wheel–rail impact vibration contains multiple excitation sources, with time-varying and random characteristics, which means that there are numerous environmental noises in the vibration signals. These strong environmental noises tend to mask the subtle variations caused by damage in the vibration signals [7, 17]. For such complex signals, traditional signal processing methods have shown shortcomings with low efficiency and poor generality. To address these issues, researchers developed some signal visualization methods by converting the vibration signals into two-dimensional grayscale images for analysis. The image features could then be extracted directly for damage detection and recognition without noise reduction [18,19,20]. These analysis methods demonstrated better robustness in strong noises environments [21, 22]. Specifically, Do et al. [23] used the scale invariant feature transform (SIFT) algorithm to extract the local feature vector of 2D images based on vibration signals for the detection and diagnosis of asynchronous motor faults. Zheng et al. [24] proposed a novel FAST-Unoriented-SIFT algorithm for extracting planetary gear fault feature values, which was more efficient and had more extracted features compared to SIFT algorithm. Meanwhile, bag-based representations have been widely used to compute the similarity between digital objects by characterizing the frequency of occurrence of object features [25]. Among them, Visual Bag-of-Words (VBoW) models have been effectively used for image feature clustering and classification tasks [26]. Qi et al. [27] used VBoW model to extract surface features of soil to efficiently characterize textural information. Yang et al. [28] extracted the vibration signal features of rotating machinery based on the VBoW model for detection and achieved good results. Zheng et al. [29] found that VBoW model had good recognition efficiency and accuracy when training small datasets of planetary gear fault features.

Inspired by the aforementioned research, in this work, a vibration-based and computer vision-aided health condition evaluation method for rail track structures is proposed. Different from the other traditional methods, this method could evade the noise reduction process, directly extract numerous redundant features, and use the redundancy of features to achieve a more robust signal description. The rest of the paper is organized as follows. Section 2 focuses on the methodology used in this work. It consists of three parts: grayscale image processing of rail vibration signals, image feature extraction and the establishment of VBoW model with optimal parameters. Section 3 describes the vibration signal acquisition during field test. Section 4 analyzes and discusses the experiment results. Specifically, the optimal parameters are first obtained from the training dataset by fourfold cross-validation. Then, the reliability and real-time of the proposed method are verified using the testing dataset. Finally, the advantages of the method used in this work is highlighted by performance comparison. Conclusive remarks are given in Sect. 5. By employing the vibration signal visualization, encoding and classification methods, the feature information of different track structural health conditions could be effectively identified.

2 Methodology

2.1 Vibration signal visualization

This section introduces the method to convert the raw vibration signals to the pixels of the grayscale image [30]. Firstly, the amplitude of one-dimensional time-domain vibration signal s is normalized to the range of [1, 255], and the normalized signal \({\text{s}}_{norm}\) is obtained according to Eq. (1).

$${\text{s}}_{{{\text{norm}}}} = \frac{127}{{\max (\left| {\text{s}} \right|)}} \cdot {\text{s}} + 128$$
(1)

where \({\text{s}} = [s_{1} ,s_{2} , \cdots ,s_{n} ]\) is the raw vibration signal, n represents the length of the signal, and \(s_{n}\) is the magnitude of the n-th value of the signal.

Then, the normalized signal \({\mathbf{s}}_{norm}\) is sequentially mapped to one pixel of the two-dimensional grayscale image G as the grayscale value. The grayscale value of each pixel of the grayscale image G is listed according to Eq. (2).

$${\text{G}}(x,y) = {\text{s}}_{{{\text{norm}}}} (N \cdot (x - 1) + y),x \in \left\{ {1,2, \ldots ,M} \right\},y \in \left\{ {1,2, \ldots ,N} \right\}$$
(2)

where G(x, y) denotes the grayscale value of the pixel in the x-th row and y-th column of the grayscale image G. \({\text{s}}_{{{\text{norm}}}} (i)\) denotes the i-th value of \({\mathbf{s}}_{{{\text{norm}}}}\). The obtained grayscale image G has M × N pixel points (i.e., M and N represent the numbers of the rows and columns of the grayscale image, respectively). The conversion method is shown in Fig. 1.

Fig. 1
figure 1

Conversion of vibration signal to grayscale image

2.2 Image feature extraction

The image feature extraction method utilized in this work combines the features from Accelerated Segment Test (FAST) algorithm [31] and the excellent feature description capability of the Unoriented Scale Invariant Feature Transform (SIFT) algorithm [32]. The FAST-Unoriented-SIFT algorithm has shown good performance and robustness in noisy environment [24].

2.2.1 FAST feature point detection

The pixel points with large difference in grayscale value from the surrounding pixels are indicated as feature points [31]. The status Spq of center pixel p with respect to pixel q is calculated according to Eq. (3). p is a candidate feature point. q is any pixel on a circle around p with r being the radius, as shown in Fig. 2.

$$S_{p \to q} = \left\{ \begin{gathered} \begin{array}{*{20}c} {d_{q} ,G_{q} \le G_{p} - t_{p} } & {} \\ \end{array} \hfill \\ \begin{array}{*{20}c} {s_{q} ,G_{p} - t_{p} \le G_{q} \le G_{p} + t_{p} } & {} \\ \end{array} \hfill \\ \begin{array}{*{20}c} {b_{q} ,G_{p} + t_{p} \le G_{q} } & {} \\ \end{array} \hfill \\ \end{gathered} \right.$$
(3)
Fig. 2
figure 2

A diagram of FAST feature point detection

where G is the grayscale value of pixel point, nc is the corner detector on this circle and tp is the threshold value. The status dq or bq represents that the point pixel on this circle is darker or brighter than the center pixel respectively, and sq represents that the point pixel is similar to the center pixel. In other words, if the number of statuses dq and bq in Spq is greater than the corner detector nc, the center pixel p of this circle is considered as a feature point.

In this work, r = 4 and nc = 12 were taken according to Ref. [31]. The threshold value tp determines the number of extracted feature points, and the related analysis is shown in Sect. 4.1.1.

2.2.2 Unoriented-SIFT feature description

After detecting the feature points, 16 × 16 window around one feature point p is firstly taken, and then the gradient modulus and direction of each pixel point within this window are calculated according to Eqs. (4) and (5).

$${\text{m}}(x,y) = \sqrt {({\text{G}}(x + 1,y) - {\text{G}}(x - 1,y))^{2} + ({\text{G}}(x,y + 1) - {\text{G}}(x,y - 1))^{2} }$$
(4)
$${\uptheta }(x,y) = \arctan \begin{array}{*{20}c} {\frac{{{\text{G}}(x,y + 1) - {\text{G}}(x,y - 1)}}{{{\text{G}}(x + 1,y) - {\text{G}}(x - 1,y)}}} & {} \\ \end{array}$$
(5)

where \({\text{m}}(x,y)\) and \({\uptheta }(x,y)\) denote the modulus and direction of the gradient, respectively.

After that, the window is divided into 16 cubes of 4 × 4, and the modulus of the gradient in each cube in 8 directions (45 degrees in each direction) is counted. An 8-dimensional description vector is then obtained for each cube, and a 128 (16 × 8) dimensional description vector \({\text{d}}\) is finally obtained for the whole window, which is the Unoriented-SIFT description vector of feature point p. The process of Unoriented-SIFT feature description is shown in Fig. 3, where the direction of the arrow represents the gradient direction of the pixel, and the length of the arrow represents the modulus of the gradient.

Fig. 3
figure 3

The process of Unoriented-SIFT feature description

2.3 VBoW modeling

In the VBoW model, each image feature (i.e., the 128-dimensional description vector \({\text{d}}\) mentioned in Sect. 2.2.2) is quantized as one word, and each grayscale image is viewed as a bag full of words. Then, similar image features are clustered into one class by k-means algorithm, and the vector of the clustering center is quantized as one keyword. Therefore, the VBoW modeling process includes keywords acquisition and grayscale image representation.

2.3.1 Keywords acquisition

By assuming that N image features \({\text{D}}_{N} { = }\left\{ {{\text{d}}_{1} ,{\text{d}}_{2} , \ldots ,{\text{d}}_{N} } \right\}\) are extracted from one grayscale image, the k-means algorithm eventually partitions \({\text{D}}_{N}\) into \(\kappa\) clusters \(\left\{ {{\text{C}}_{i} |i = 1,2, \ldots ,\kappa } \right\}\) by minimizing the squared error according to Eq. (6).

$$e{ = }\sum\limits_{i = 1}^{\kappa } {\sum\limits_{{{\text{d}}_{j} \in {\text{C}}_{i} }}^{{}} {\left\| {{\text{d}}_{j} - {\upmu }_{i} } \right\|} }^{2}$$
(6)

where \({\text{C}}_{{i^{\prime}}} \cap {\text{C}}_{{i^{\prime\prime} \ne i^{\prime}}} = \emptyset\) and \({\text{D}}_{N} = \cup_{i = 1}^{\kappa } {\text{C}}_{i}\). \({\text{d}}_{j}\) denotes one image feature in \({\text{C}}_{i}\). \({\upmu }_{i}\) denotes the cluster center in \({\text{C}}_{i}\). The smaller the value of e is, the better the clustering result is. The calculation process is shown in Algorithm 1.

figure a

2.3.2 Grayscale image representation

After the keyword acquisition, the frequency of keywords in each grayscale image is counted to obtain the word frequency vector. In the training set, the codebook vector is obtained by averaging the word frequency vectors represented by all grayscale images related to the same health condition, i.e., each codebook vector describes one health condition of the track structure. By calculating the Euclidean distance between the word frequency vector extracted from each grayscale image in the testing set and the codebook vectors of three health conditions in the training set, each word frequency vector is categorized to one health condition according to the minimum Euclidean distance to the codebook vector. The process is shown in Fig. 4.

Fig. 4
figure 4

The process of grayscale images representation (Notes: ◆,☆,★ … denote keywords)

2.3.3 Algorithm of selecting optimal number of keywords

The number of keywords \(\kappa\) is a critical hyper-parameter of VBoW model. To optimize the recognition accuracy and stability of the proposed health condition evaluation method, fourfold cross-validation is used to select the optimum of \(\kappa\) during training step. Also, the value function \(\varepsilon (\kappa )\), positively correlated with recognition rate and negatively correlated with standard deviation, is constructed to evaluate the merit of the selection of \(\kappa\) (see Eq. (7)).

$$\varepsilon (\kappa ){ = }1 - \frac{{P_{\kappa } }}{{e^{{S_{\kappa } }} }}$$
(7)

where \(P_{\kappa }\) and \(S_{\kappa }\) denote the overall recognition rate and standard deviation of the fourfold cross-validation results. The smaller the value of \(\varepsilon (\kappa )\) is, the better the recognition accuracy and stability of the model corresponding to this hyper-parameter are.

To improve the computational efficiency, firstly the recommended data ranges of \(\kappa\) are obtained by equal-interval down-sampling method. Next, the value function for each \(\kappa\) in these ranges are calculated, and the \(\kappa\) with the minimum value function is the optimum \(\kappa_{opt}\). The overview of the algorithm to seek the optimal number of keywords is shown in Algorithm 2.

figure b

2.4 Rail track structural condition identification

In this work, the proposed method of rail track structural health condition consists of the following steps: Firstly, the raw vibration signals are subsampled and converted into grayscale images using the method in Sect. 2.1, which are divided into two parts, i.e., the training set and the testing set. Then, the FAST-Unoriented-SIFT algorithm in Sect. 2.2 is used to extract features from the two datasets. After that, the VBoW model is built, respectively for the training set and the testing set using the method in Sect. 2.3. Specifically, based on vibration signals conversion and feature extraction, the optimum of keywords \(\kappa_{opt}\) is obtained by the algorithm described in Sect. 2.3.3. By calculating the Euclidean distance between the word frequency vector extracted from each grayscale image in the testing set and the codebook vectors of three health conditions in the training set, each word frequency vector is categorized to one health condition according to the minimum Euclidean distance to the codebook vector. The proposed rail track structural condition evaluation process is summarized in Fig. 5.

Fig. 5
figure 5

The proposed rail track structural health condition evaluation method

3 Field test and vibration signal acquisition

The tested metro line has a total length of 80 km and the maximum operating speed is 80 km/h. To be specific, the tested rail track system is located in a single-hole, single-line circular shield tunnel. Two rails of approximately 60 kg/m are supported by rail fasteners system spaced at 0.6 m intervals on both sides along the line. Clips and anchor bolts of rail fastening system are used to hold the rails firmly on the track slab. Both the track slab and the subgrade are laid on the tunnel lining with cast-in-place reinforced concrete, with 20 mm expansion joints at every 12.5 m track slab. The cross-sectional layout of the railway track structures is shown in Fig. 6.

Fig. 6
figure 6

Cross-sectional layout of rail track structure

In this work, the damage type of rail track structure is the separation crack between concrete subgrade and tunnel lining, which mainly occurs near the expansion joint at the end of the track slab. Therefore, the entire field test includes the measurement of geometrical parameters of the separation crack and the vibration signal acquisition.

3.1 Separation crack measurement

First, the geometric parameters of the separation crack in the same straight track were measured, including the penetration depth along the direction perpendicular to the tunnel lining and crack opening displacement, which were measured by feeler gauge and steel ruler, respectively. With the expansion joint being the center (denoted as location of “0”), the geometrical parameters of the separation crack at a few locations on both sides of the center along the direction of the rail track were measured and recorded. The locations and geometric parameters of cracks are shown in Table 1.

Table 1 Locations and geometric parameters of cracks

With the continuous periodic loading from passing trains, the separation crack between concrete subgrade and tunnel lining will further propagate along the cross-sectional interface. The crack propagation is affected by the component of the train loading normal to the propagation direction. When the direction of the train loading is close to 90º to the crack propagation direction (i.e., close to penetration), the train loading influences the crack propagation significantly [33, 34]. Once the crack is fully penetrated, the concrete subgrade above the cracked area lacks the connection with the tunnel lining, leading to the significant decrease in the bearing capacity of the subgrade and the rapid crack propagation along the longitudinal direction of the track. Therefore, the penetration status is crucial in terms of crack propagation. The classification criteria for the rail track structural health conditions is therefore based on whether penetration is reached, i.e., the minor damage corresponding to the pre-penetration status and the severe damage to the fully penetrated status, as indicated in Fig. 7b and c, respectively.

Fig. 7
figure 7

Schematics of rail tracks in different structural health conditions (top view): a normal; b minor damage; c severe damage

3.2 Vibration signal acquisition

After determining the damage types of the rail track structures, vibrations of the rail track with passing trains were measured. Three accelerometers were installed on the rail flange at the location of the expansion joints under three structural health conditions respectively, as shown in Fig. 8.

Fig. 8
figure 8

Accelerometer for rail track vibration measurement

The vibration signals are acquired by the dynamic signal test and analysis system (model: DH5902N, DongHua Testing Technology Co., Ltd), with 16 data acquisition channels. The accelerometers used in the test is piezoelectric type (Model: DH1A108E, DongHua Testing Technology Co., Ltd) and the specification is listed in Table 2.

Table 2 Specification of accelerometer

In this field test, the vehicle operating speed was 35 ~ 50 km/h. When one train passed through a section of rail track, the vibrational displacement of that section was very significant within about 7.5 s. While the sampling frequency of vibration signals in this field test was 1000 Hz, considering that the rail would vibrate freely for a short time after the train passes, the corresponding time for vibration signal truncation in this work was 8.1 s, i.e., 8100 points for each segmented vibration signal. The time-domain acceleration signals of rail vibrations under three health conditions are shown in Fig. 9.

Fig. 9
figure 9

Vibration signals: a normal; b minor damage; c severe damage

Overall, compared with the normal case, the vibration amplitude of minor damage has a slight increase when the same train passing, but the change is not obvious. However, the signal amplitude has a relatively obvious increase in the severe damage case than the other two cases. It shows that there are differences in vibration signals under different rail track structural health conditions, but the vibration signal is not sensitive to this early-stage structural damage.

4 Results and discussion

In this work, 100 datasets corresponding to each health condition (totally 100 × 3 datasets for three health conditions) are utilized for identification. Specifically, 80 datasets from each health condition (80 × 3 datasets totally) are utilized for training and fourfold cross-validation to ensure the robustness of the algorithm in Sect. 4.1. Then, the remaining 20 datasets from each health condition (20 × 3 datasets totally) are utilized for testing to verify the accuracy and effectiveness in Sects. 4.2 and 4.3. Finally, in Sect. 4.4, the performance of the proposed method is compared with other methods using the same 100 × 3 datasets.

4.1 Optimal parameter selection

The selection of two parameters is important to build the VBoW model, i.e., the threshold value for feature extraction and the optimal number of keywords.

4.1.1 Selection of threshold value

The selection of the threshold value tp (see Eq. (3)) will not only affect the number of extracted features, but also can affect the clustering in the VBoW model. The calculation results of number of extracted features and value function with varying tp are shown in Fig. 10, where \(\kappa\) is 167, and this parameter is selected optimally as shown in Sect. 4.1.2.

Fig. 10
figure 10

Number of extracted features and value function with varying tp

It is found that as tp increases, fewer features are extracted, and the value function increases accordingly. Also, obviously, when tp = 5, the value function is lower and more features could be extracted as compared to the other selected values. Therefore, tp = 5 is selected for further analysis.

4.1.2 Selection of the optimal number of keywords

In this work, the data range of \(\kappa\) is initially set to [10, 300], the minimum of value function \(\varepsilon_{\min }\) is set to 0.05, the minimal difference of two adjacent value functions \(\varepsilon_{adj}\) is set to 0.001 and the interval M is set to 30 by considering the computational cost. To ensure that the overall trend of the value function varying with the number of keywords is less affected by local fluctuations, it is necessary to employ average filtering on the results of each value function. Specifically, the value function corresponding to each keyword is the averaged value function of five adjacent keywords. The calculated result is shown in Fig. 11. Overall, the \(\varepsilon (\kappa )\) decreases as the value of \(\kappa\) increases. When \(\kappa\) is greater than 150, the change of the value function tends to level off. Moreover, to obtain the recommended data ranges of the value function, the difference between two adjacent value functions is further analyzed. According to the algorithm described in the Sect. 2.3.3, the recommended data ranges of \(\kappa\) are [160, 190] and [190, 220].

Fig. 11
figure 11

Value function and difference of two adjacent value functions with M = 30

By comparing all the value function in above recommended data ranges (Fig. 12), the minimum value function is lower than \(\varepsilon_{\min }\) when the \(\kappa\) is taken as 167, 190 and 210. By considering the computational cost, \(\kappa_{opt}\) = 167 is selected and regarded as the optimum.

Fig. 12
figure 12

Comparison of the value function in the recommended data ranges with M = 30

4.1.3 Training results validation

To further illustrate the recognition accuracy and stability of the proposed health condition evaluation method under \(\kappa_{opt}\) taken in Sect. 4.1.2, the overall recognition rate (\(P_{\kappa }\) in Eq. (7)) and standard deviation (\(S_{\kappa }\) in Eq. (7)) when \(\kappa\) is set to [3, 300] are shown in Fig. 13. Although value-sweeping calculation is not used in our method, here it is used to demonstrate the algorithm performance and to show the validity of the proposed method for selecting \(\kappa_{opt}\).

Fig. 13
figure 13

The overall recognition rate and standard deviation of training set with varying \(\kappa\)

In general, with the increase of \(\kappa\), the \(P_{\kappa }\) of the algorithm increases and \(S_{\kappa }\) decreases. As can be seen in Fig. 13, when \(\kappa\) is greater than 150, the algorithm can maintain a high recognition rate and low standard deviation, indicating that the algorithm is suitable for the recognition of rail track structural health condition and had high stability. The overall recognition rate of fourfold cross-validation on the training set reaches 97.1% when \(\kappa\) is 167 as selected in the previous work. Although the overall recognition rate could reach 97.9% when \(\kappa\) is taken as 298, it does not improve much, and the computational cost is significantly increased. Therefore, the \(\kappa_{opt}\) selected by the proposed method can ensure the recognition accuracy and stability of the proposed health condition evaluation method.

4.2 Testing results

To verify the applicability of the recognition performance of the method used in this work, the remaining 20 × 3 signals are tested and analyzed here. The test result is shown in Fig. 14. Specifically, the horizontal axis represents the 60 signals in the testing set while the vertical axis represents the actual three rail track structural health conditions. Moreover, the solid points represent the real output of each sample, and the circles represent the target output of the test samples. The test results show that 58 of the 60 testing samples are correctly identified, with an overall recognition rate of 96.67%.

Fig. 14
figure 14

The recognition result of testing set

4.3 Real-time analysis

The operating efficiency of this recognition method directly affects its practical engineering value. To illustrate the effectiveness of the proposed method, the real-time performance of test set is analyzed. The computational platform consists of an i7-10750H CPU and a NVIDIA GeForce RTX 2060 graphics processing unit. The whole recognition process based on VBoW model is implemented with Matlab R2020 under Windows 10. In addition, the recognition process has three steps, i.e., signal conversion, image feature extraction and health condition identification. The average time for completing each step is shown in Table 3.

Table 3 Real-time analysis of testing set

As shown in Table 3, for all the three health conditions of the rail track structures in this work, the average time for the whole recognition process using VBoW model is relatively small, within 0.15 s. In Sect. 3.2, each segmented vibration signal sample is 8.1 s. Therefore, VBoW model is very efficient and has the ability to meet the real-time requirement.

4.4 Performance comparison by different methods

To further illustrate the superiority of the proposed health condition evaluation method, it is compared with the representative traditional learning-based classification algorithms and the deep learning models. Specifically, the traditional learning-based classification algorithms are selected as: Support Vector Machine (SVM) [35, 36] and K-Nearest Neighbor algorithm (KNN) [37, 38], and the deep learning models are selected as: AlexNet [39], ResNet-18 [40], and DarkNet-53 [41]. To be more detailed, Grid Searching (GS) technique is adopted to optimize the parameters (e.g., penalty coefficient and the kernel function parameter) of the SVM model, and the number of nearest neighbors in this work is set to be 1 for KNN, which is optimal.

All the above six comparison methods are run in the same MATLAB environment, accepting other default parameter values. In addition, all methods are subjected to fourfold cross-validation on 80 × 3 training signals and performance validation on 20 × 3 testing signals. The calculation results are shown in Fig. 15. The overall recognition rate and standard deviation are taken to reflect the reliability and stability, respectively.

Fig. 15
figure 15

Comparison of recognition results with different algorithms and models

As shown in Fig. 15, the overall recognition rates of KNN and SVM algorithms for wheel–rail vibration signals are only about 60% and 77%, respectively. Although the overall recognition rates of AlexNet, ResNet-18 and DarkNet-53 algorithms exceed 80%, this is insufficient to meet the requirements of real-time monitoring in real-life applications. It can be found that our method has a higher recognition rate of 97.1% as compared to the other five models and algorithms. Furthermore, our method obtains a lower standard deviation, demonstrating more stable recognition behavior. In addition, the overall recognition result by 96.7% from the testing set also revalidates the better performance of the proposed method.

5 Conclusions and future work

In this work, a novel computer vision-aided method to evaluate the nondestructive health condition of rail track structures based on vibration signals has been proposed. Specially, the method used in this work does not require tedious noise reduction processing and redundant feature elimination. By directly converting the raw vibration signals into grayscale images, we have adopted multi-dimensional feature vectors of images instead of one-dimensional feature array of traditional signal processing methods. To quickly extract numerous features from the vibration signals, the FAST-Unoriented-SIFT algorithm has been utilized. Meanwhile, the VBoW model with optimal keyword has been proposed for well describing and identifying grayscale images features. Finally, the overall recognition rate of the proposed model in the testing set is 96.7% (i.e., 58 out of 60). In addition, by comparison to the traditional learning-based classification algorithms and the representative deep learning models, it is found that the proposed method is more suitable for the effective identification of strong time-varying and random vibration signals and has promising prospects for practical structural health monitoring applications.

Although the proposed method can achieve satisfactory results in this work for the damage identification under different rail track structural health conditions, due to the learning capacity, all these popular algorithms based on Visual Bag-of-Words model are restricted to relatively short signal segments. However, high-dimensional signals and features can better capture the valuable information of dynamic system under a more complex environment. On the other hand, in rail track structural health monitoring, to obtain more spatial information, numerous sensors need to be fixed along the track line for real-time monitoring, which not only results in data overload but also increases operation and maintenance costs. In contrast, placing mobile sensor networks on vehicle components (e.g., axlebox) for health monitoring of rail track structures have greater potential for low-cost monitoring applications. However, the rail track vibration signal data collected by the mobile sensor networks is spatio-temporal, and is subjected to vehicle interference noise. Recent researches [42, 43] proposed some methods where sparse vibration data based on mobile sensor networks were used to successfully achieve feature identification of bridge structures. Therefore, future work can be focused on investigating a more robust extraction method of the high-dimensional feature information of wheel–rail dynamic system around the method of VBoW model and sparse representation using vehicle mobile sensor networks.