Vibration-based and computer vision-aided nondestructive health condition evaluation of rail track structures

In railway engineering, monitoring the health condition of rail track structures is crucial to prevent abnormal vibration issues of the wheel–rail system. To address the problem of low efficiency of traditional nondestructive testing methods, this work investigates the feasibility of the computer vision-aided health condition monitoring approach for track structures based on vibration signals. The proposed method eliminates the tedious and complicated data pre-processing including signal mapping and noise reduction, which can achieve robust signal description using numerous redundant features. First, the method converts the raw wheel–rail vibration signals directly into two-dimensional grayscale images, followed by image feature extraction using the FAST-Unoriented-SIFT algorithm. Subsequently, Visual Bag-of-Words (VBoW) model is established based on the image features, where the optimal parameter selection analysis is implemented based on fourfold cross-validation by considering both recognition accuracy and stability. Finally, the Euclidean distance between word frequency vectors of testing set and the codebook vectors of training set is compared to recognize the health condition of track structures. For the three health conditions of track structures analyzed in this paper, the overall recognition rate could reach 96.7%. The results demonstrate that the proposed method performs higher recognition accuracy and lower bias with strong time-varying and random vibration signals, which has promising application prospect in early-stage structural defect detection.


Introduction
With the rapid development of urban rail transportation, ballastless track has been widely used in many metro lines because of its advantages of low periodic maintenance and greater vehicle stability [1]. For metro shield tunnels, the track structures consist of rail, fasteners, track slab, subgrade and tunnel lining. The track slab and subgrade are usually constructed as monolithic cast-in-place concrete structures on the tunnel lining. However, under continuous cyclic trainloads, cracks often occur at the interface between the subgrade and the tunnel lining near the expansion joints of track slab [2]. Over time, this type of cracks will further propagate along the cross-sectional interface, ultimately affecting damage to other vehicle components or track structures [3]. Before such damage issues can be detected visually, the track structures have not been able to achieve normal service performance, which will endanger the operation of trains. Therefore, health condition monitoring and damage detection of rail track structures, especially the earlystage damage identification, is critically important [4][5][6].
As the vibration properties (e.g., the natural frequency, amplitude, and damping ratio) of structural system vary with the mechanical parameters and boundary conditions, many researchers attempted to utilize the variation of vibration properties to detect the structural health condition [7][8][9][10]. And, the commonly used methods for vibration signal analysis included empirical mode decomposition (EMD) [11,12], short-time Fourier transform (STFT) [13,14], wavelet transform (WT) [15,16], etc. In general, the traditional vibration Shaohua Wang and Hao Zheng have contributed equally to this work and share first authorship. signal-based health condition evaluation in the frequency domain required pre-processing of the raw signals. Hence, complex mapping and transformations were performed to filter out noise and redundant information to highlight the information related to the damage. Finally, eigenvalues or feature vectors that best characterize the damage were extracted.
However, the noise is a challenging issue to deal with. In practical engineering, wheel-rail impact vibration contains multiple excitation sources, with time-varying and random characteristics, which means that there are numerous environmental noises in the vibration signals. These strong environmental noises tend to mask the subtle variations caused by damage in the vibration signals [7,17]. For such complex signals, traditional signal processing methods have shown shortcomings with low efficiency and poor generality. To address these issues, researchers developed some signal visualization methods by converting the vibration signals into two-dimensional grayscale images for analysis. The image features could then be extracted directly for damage detection and recognition without noise reduction [18][19][20]. These analysis methods demonstrated better robustness in strong noises environments [21,22]. Specifically, Do et al. [23] used the scale invariant feature transform (SIFT) algorithm to extract the local feature vector of 2D images based on vibration signals for the detection and diagnosis of asynchronous motor faults. Zheng et al. [24] proposed a novel FAST-Unoriented-SIFT algorithm for extracting planetary gear fault feature values, which was more efficient and had more extracted features compared to SIFT algorithm. Meanwhile, bag-based representations have been widely used to compute the similarity between digital objects by characterizing the frequency of occurrence of object features [25]. Among them, Visual Bag-of-Words (VBoW) models have been effectively used for image feature clustering and classification tasks [26]. Qi et al. [27] used VBoW model to extract surface features of soil to efficiently characterize textural information. Yang et al. [28] extracted the vibration signal features of rotating machinery based on the VBoW model for detection and achieved good results. Zheng et al. [29] found that VBoW model had good recognition efficiency and accuracy when training small datasets of planetary gear fault features.
Inspired by the aforementioned research, in this work, a vibration-based and computer vision-aided health condition evaluation method for rail track structures is proposed. Different from the other traditional methods, this method could evade the noise reduction process, directly extract numerous redundant features, and use the redundancy of features to achieve a more robust signal description. The rest of the paper is organized as follows. Section 2 focuses on the methodology used in this work. It consists of three parts: grayscale image processing of rail vibration signals, image feature extraction and the establishment of VBoW model with optimal parameters. Section 3 describes the vibration signal acquisition during field test. Section 4 analyzes and discusses the experiment results. Specifically, the optimal parameters are first obtained from the training dataset by fourfold cross-validation. Then, the reliability and realtime of the proposed method are verified using the testing dataset. Finally, the advantages of the method used in this work is highlighted by performance comparison. Conclusive remarks are given in Sect. 5. By employing the vibration signal visualization, encoding and classification methods, the feature information of different track structural health conditions could be effectively identified.

Vibration signal visualization
This section introduces the method to convert the raw vibration signals to the pixels of the grayscale image [30]. Firstly, the amplitude of one-dimensional time-domain vibration signal s is normalized to the range of [1,255], and the normalized signal s norm is obtained according to Eq. (1).
where s = [s 1 , s 2 , ⋯ , s n ] is the raw vibration signal, n represents the length of the signal, and s n is the magnitude of the n-th value of the signal.
Then, the normalized signal norm is sequentially mapped to one pixel of the two-dimensional grayscale image G as the grayscale value. The grayscale value of each pixel of the grayscale image G is listed according to Eq. (2).
where G(x, y) denotes the grayscale value of the pixel in the x-th row and y-th column of the grayscale image G. s norm (i) denotes the i-th value of norm . The obtained grayscale image G has M × N pixel points (i.e., M and N represent the numbers of the rows and columns of the grayscale image, respectively). The conversion method is shown in Fig. 1.

Image feature extraction
The image feature extraction method utilized in this work combines the features from Accelerated Segment Test (FAST) algorithm [31] and the excellent feature description capability of the Unoriented Scale Invariant Feature Transform (SIFT) algorithm [32]. The FAST-Unoriented-SIFT algorithm has shown good performance and robustness in noisy environment [24].

FAST feature point detection
The pixel points with large difference in grayscale value from the surrounding pixels are indicated as feature points [31]. The status S p→q of center pixel p with respect to pixel q is calculated according to Eq. (3). p is a candidate feature point. q is any pixel on a circle around p with r being the radius, as shown in Fig. 2.
where G is the grayscale value of pixel point, n c is the corner detector on this circle and t p is the threshold value. The status d q or b q represents that the point pixel on this circle is darker or brighter than the center pixel respectively, and s q represents that the point pixel is similar to the center pixel. In other words, if the number of statuses d q and b q in S p→q is greater than the corner detector n c , the center pixel p of this circle is considered as a feature point.
In this work, r = 4 and n c = 12 were taken according to Ref. [31]. The threshold value t p determines the number of extracted feature points, and the related analysis is shown in Sect. 4.1.1.

Unoriented-SIFT feature description
After detecting the feature points, 16 × 16 window around one feature point p is firstly taken, and then the gradient modulus and direction of each pixel point within this window are calculated according to Eqs. (4) and (5).  where m(x, y) and θ(x, y) denote the modulus and direction of the gradient, respectively.
After that, the window is divided into 16 cubes of 4 × 4, and the modulus of the gradient in each cube in 8 directions (45 degrees in each direction) is counted. An 8-dimensional description vector is then obtained for each cube, and a 128 (16 × 8) dimensional description vector d is finally obtained for the whole window, which is the Unoriented-SIFT description vector of feature point p. The process of Unoriented-SIFT feature description is shown in Fig. 3, where the direction of the arrow represents the gradient direction of the pixel, and the length of the arrow represents the modulus of the gradient.

VBoW modeling
In the VBoW model, each image feature (i.e., the 128-dimensional description vector d mentioned in Sect. 2.2.2) is quantized as one word, and each grayscale image is viewed as a bag full of words. Then, similar image features are clustered into one class by k-means algorithm, and the vector of the clustering center is quantized as one keyword. Therefore, the VBoW modeling process includes keywords acquisition and grayscale image representation.

Keywords acquisition
By assuming that N image features D N = d 1 , d 2 , … , d N are extracted from one grayscale image, the k-means algorithm eventually partitions D N into clusters C i |i = 1, 2, … , by minimizing the squared error according to Eq. (6).
The smaller the value of e is, the better the clustering result is. The calculation process is shown in Algorithm 1. Fig. 3 The process of Unoriented-SIFT feature description

Grayscale image representation
After the keyword acquisition, the frequency of keywords in each grayscale image is counted to obtain the word frequency vector. In the training set, the codebook vector is obtained by averaging the word frequency vectors represented by all grayscale images related to the same health condition, i.e., each codebook vector describes one health condition of the track structure. By calculating the Euclidean distance between the word frequency vector extracted from each grayscale image in the testing set and the codebook vectors of three health conditions in the training set, each word frequency vector is categorized to one health condition according to the minimum Euclidean distance to the codebook vector. The process is shown in Fig. 4.

Algorithm of selecting optimal number of keywords
The number of keywords is a critical hyper-parameter of VBoW model. To optimize the recognition accuracy and stability of the proposed health condition evaluation method, fourfold cross-validation is used to select the optimum of during training step. Also, the value function ( ) , positively correlated with recognition rate and negatively correlated with standard deviation, is constructed to evaluate the merit of the selection of (see Eq. (7)).
where P and S denote the overall recognition rate and standard deviation of the fourfold cross-validation results.
The smaller the value of ( ) is, the better the recognition accuracy and stability of the model corresponding to this hyper-parameter are.
To improve the computational efficiency, firstly the recommended data ranges of are obtained by equal-interval down-sampling method. Next, the value function for each in these ranges are calculated, and the with the minimum value function is the optimum opt . The overview of the algorithm to seek the optimal number of keywords is shown in Algorithm 2.

Rail track structural condition identification
In this work, the proposed method of rail track structural health condition consists of the following steps: Firstly, the raw vibration signals are subsampled and converted into grayscale images using the method in Sect. 2.1, which are divided into two parts, i.e., the training set and the testing set. Then, the FAST-Unoriented-SIFT algorithm in Sect. 2.2 is used to extract features from the two datasets. After that, the VBoW model is built, respectively for the training set and the testing set using the method in Sect. 2.3. Specifically, based on vibration signals conversion and feature extraction, the optimum of keywords opt is obtained by the algorithm described in Sect. 2.3.3. By calculating the Euclidean distance between the word frequency vector extracted from each grayscale image in the testing set and the codebook vectors of three health conditions in the training set, each word frequency vector is categorized to one health condition according to the minimum Euclidean distance to the codebook vector. The proposed rail track structural condition evaluation process is summarized in Fig. 5.

Field test and vibration signal acquisition
The tested metro line has a total length of 80 km and the maximum operating speed is 80 km/h. To be specific, the tested rail track system is located in a single-hole, single-line circular shield tunnel. Two rails of approximately 60 kg/m are supported by rail fasteners system spaced at 0.6 m intervals on both sides along the line. Clips and anchor bolts of rail fastening system are used to hold the rails firmly on the track slab. Both the track slab and the subgrade are laid on the tunnel lining with cast-in-place reinforced concrete, with 20 mm expansion joints at every 12.5 m track slab. The cross-sectional layout of the railway track structures is shown in Fig. 6.
In this work, the damage type of rail track structure is the separation crack between concrete subgrade and tunnel lining, which mainly occurs near the expansion joint at the end of the track slab. Therefore, the entire field test includes the measurement of geometrical parameters of the separation crack and the vibration signal acquisition.

Separation crack measurement
First, the geometric parameters of the separation crack in the same straight track were measured, including the penetration depth along the direction perpendicular to the tunnel lining and crack opening displacement, which were measured by feeler gauge and steel ruler, respectively. With the expansion joint being the center (denoted as location of "0"), the geometrical parameters of the separation crack at a few locations on both sides of the center along the direction of the rail track were measured and recorded. The locations and geometric parameters of cracks are shown in Table 1.
With the continuous periodic loading from passing trains, the separation crack between concrete subgrade and tunnel lining will further propagate along the cross-sectional interface. The crack propagation is affected by the component of the train loading normal to the propagation direction. When the direction of the train loading is close to 90º to the crack propagation direction (i.e., close to penetration), the  train loading influences the crack propagation significantly [33,34]. Once the crack is fully penetrated, the concrete subgrade above the cracked area lacks the connection with the tunnel lining, leading to the significant decrease in the bearing capacity of the subgrade and the rapid crack propagation along the longitudinal direction of the track. Therefore, the penetration status is crucial in terms of crack propagation. The classification criteria for the rail track structural Fig. 9 Vibration signals: a normal; b minor damage; c severe damage health conditions is therefore based on whether penetration is reached, i.e., the minor damage corresponding to the prepenetration status and the severe damage to the fully penetrated status, as indicated in Fig. 7b and c, respectively.

Vibration signal acquisition
After determining the damage types of the rail track structures, vibrations of the rail track with passing trains were measured. Three accelerometers were installed on the rail flange at the location of the expansion joints under three structural health conditions respectively, as shown in Fig. 8.
The vibration signals are acquired by the dynamic signal test and analysis system (model: DH5902N, DongHua Testing Technology Co., Ltd), with 16 data acquisition channels. The accelerometers used in the test is piezoelectric   Table 2.
In this field test, the vehicle operating speed was 35 ~ 50 km/h. When one train passed through a section of rail track, the vibrational displacement of that section was very significant within about 7.5 s. While the sampling frequency of vibration signals in this field test was 1000 Hz, considering that the rail would vibrate freely for a short time after the train passes, the corresponding time for vibration signal truncation in this work was 8.1 s, i.e., 8100 points for each segmented vibration signal. The time-domain acceleration signals of rail vibrations under three health conditions are shown in Fig. 9.
Overall, compared with the normal case, the vibration amplitude of minor damage has a slight increase when the same train passing, but the change is not obvious. However, the signal amplitude has a relatively obvious increase in the severe damage case than the other two cases. It shows that there are differences in vibration signals under different rail track structural health conditions, but the vibration signal is not sensitive to this early-stage structural damage.

Results and discussion
In this work, 100 datasets corresponding to each health condition (totally 100 × 3 datasets for three health conditions) are utilized for identification. Specifically, 80 datasets from each health condition (80 × 3 datasets totally) are utilized for training and fourfold cross-validation to ensure the robustness of the algorithm in Sect. 4.1. Then, the remaining 20 datasets from each health condition (20 × 3 datasets totally) are utilized for testing to verify the accuracy and effectiveness in Sects. 4.2 and 4.3. Finally, in Sect. 4.4, the performance of the proposed method is compared with other methods using the same 100 × 3 datasets.

Optimal parameter selection
The selection of two parameters is important to build the VBoW model, i.e., the threshold value for feature extraction and the optimal number of keywords.

Selection of threshold value
The selection of the threshold value t p (see Eq. (3)) will not only affect the number of extracted features, but also can affect the clustering in the VBoW model. The calculation results of number of extracted features and value function with varying t p are shown in Fig. 10, where is 167, and this parameter is selected optimally as shown in Sect. 4.1.2.
It is found that as t p increases, fewer features are extracted, and the value function increases accordingly. Also, obviously, when t p = 5, the value function is lower and more features could be extracted as compared to the other selected values. Therefore, t p = 5 is selected for further analysis.

Selection of the optimal number of keywords
In this work, the data range of is initially set to [10,300], the minimum of value function min is set to 0.05, the minimal difference of two adjacent value functions adj is set to 0.001 and the interval M is set to 30 by considering the computational cost. To ensure that the overall trend of the   value function varying with the number of keywords is less affected by local fluctuations, it is necessary to employ average filtering on the results of each value function. Specifically, the value function corresponding to each keyword is the averaged value function of five adjacent keywords.
The calculated result is shown in Fig. 11. Overall, the ( ) decreases as the value of increases. When is greater than 150, the change of the value function tends to level off. Moreover, to obtain the recommended data ranges of the value function, the difference between two adjacent value functions is further analyzed. According to the algorithm described in the Sect. 2.3.3, the recommended data ranges of are [160,190] and [190,220]. By comparing all the value function in above recommended data ranges (Fig. 12), the minimum value function is lower than min when the is taken as 167, 190 and 210. By considering the computational cost, opt = 167 is selected and regarded as the optimum.

Training results validation
To further illustrate the recognition accuracy and stability of the proposed health condition evaluation method under opt taken in Sect. 4.1.2, the overall recognition rate ( P in Eq. (7)) and standard deviation ( S in Eq. (7)) when is set to [3,300] are shown in Fig. 13. Although value-sweeping calculation is not used in our method, here it is used to demonstrate the algorithm performance and to show the validity of the proposed method for selecting opt .
In general, with the increase of , the P of the algorithm increases and S decreases. As can be seen in Fig. 13, when is greater than 150, the algorithm can maintain a high recognition rate and low standard deviation, indicating that the algorithm is suitable for the recognition of rail track structural health condition and had high stability. The overall recognition rate of fourfold cross-validation on the training set reaches 97.1% when is 167 as selected in the previous work. Although the overall recognition rate could reach 97.9% when is taken as 298, it does not improve much, and the computational cost is significantly increased. Therefore, the opt selected by the proposed method can ensure the recognition accuracy and stability of the proposed health condition evaluation method.

Testing results
To verify the applicability of the recognition performance of the method used in this work, the remaining 20 × 3 signals are tested and analyzed here. The test result is shown in Fig. 14. Specifically, the horizontal axis represents the 60 signals in the testing set while the vertical axis represents the actual three rail track structural health conditions. Moreover, the solid points represent the real output of each sample, and the circles represent the target output of the test samples. The test results show that 58 of the 60 testing samples are correctly identified, with an overall recognition rate of 96.67%.

Real-time analysis
The operating efficiency of this recognition method directly affects its practical engineering value. To illustrate the effectiveness of the proposed method, the real-time performance of test set is analyzed. The computational platform consists of an i7-10750H CPU and a NVIDIA GeForce RTX 2060 graphics processing unit. The whole recognition process based on VBoW model is implemented with Matlab R2020 under Windows 10. In addition, the recognition process has three steps, i.e., signal conversion, image feature extraction and health condition identification. The average time for completing each step is shown in Table 3.
As shown in Table 3, for all the three health conditions of the rail track structures in this work, the average time for the whole recognition process using VBoW model is relatively small, within 0.15 s. In Sect. 3.2, each segmented vibration signal sample is 8.1 s. Therefore, VBoW model is very efficient and has the ability to meet the real-time requirement.

Performance comparison by different methods
To further illustrate the superiority of the proposed health condition evaluation method, it is compared with the representative traditional learning-based classification algorithms and the deep learning models. Specifically, the traditional learningbased classification algorithms are selected as: Support Vector Machine (SVM) [35,36] and K-Nearest Neighbor algorithm (KNN) [37,38], and the deep learning models are selected as: AlexNet [39], ResNet-18 [40], and DarkNet-53 [41]. To be more detailed, Grid Searching (GS) technique is adopted to optimize the parameters (e.g., penalty coefficient and the kernel function parameter) of the SVM model, and the number of nearest neighbors in this work is set to be 1 for KNN, which is optimal. All the above six comparison methods are run in the same MATLAB environment, accepting other default parameter values. In addition, all methods are subjected to fourfold crossvalidation on 80 × 3 training signals and performance validation on 20 × 3 testing signals. The calculation results are shown in Fig. 15. The overall recognition rate and standard deviation are taken to reflect the reliability and stability, respectively.
As shown in Fig. 15, the overall recognition rates of KNN and SVM algorithms for wheel-rail vibration signals are only about 60% and 77%, respectively. Although the overall recognition rates of AlexNet, ResNet-18 and DarkNet-53 algorithms exceed 80%, this is insufficient to meet the requirements of real-time monitoring in real-life applications. It can be found that our method has a higher recognition rate of 97.1% as compared to the other five models and algorithms. Furthermore, our method obtains a lower standard deviation, demonstrating more stable recognition behavior. In addition, the overall recognition result by 96.7% from the testing set also revalidates the better performance of the proposed method.

Conclusions and future work
In this work, a novel computer vision-aided method to evaluate the nondestructive health condition of rail track structures based on vibration signals has been proposed. Specially, the method used in this work does not require tedious noise reduction processing and redundant feature elimination. By directly converting the raw vibration signals into grayscale images, we have adopted multi-dimensional feature vectors of images instead of one-dimensional feature array of traditional signal processing methods. To quickly extract numerous features from the vibration signals, the FAST-Unoriented-SIFT algorithm has been utilized. Meanwhile, the VBoW model with optimal keyword has been proposed for well describing and identifying grayscale images features. Finally, the overall recognition rate of the proposed model in the testing set is 96.7% (i.e., 58 out of 60). In addition, by comparison to the traditional learning-based classification algorithms and the representative deep learning models, it is found that the proposed method is more suitable for the effective identification of strong time-varying and random vibration signals and has promising prospects for practical structural health monitoring applications.
Although the proposed method can achieve satisfactory results in this work for the damage identification under different rail track structural health conditions, due to the learning capacity, all these popular algorithms based on Visual Bag-of-Words model are restricted to relatively short signal segments. However, high-dimensional signals and features can better capture the valuable information of dynamic system under a more complex environment. On the other hand, in rail track structural health monitoring, to obtain more spatial information, numerous sensors need to be fixed along the track line for real-time monitoring, which not only results in data overload but also increases operation and maintenance costs. In contrast, placing mobile sensor networks on vehicle components (e.g., axlebox) for health monitoring of rail track structures have greater potential for low-cost monitoring applications. However, the rail track vibration signal data collected by the mobile sensor networks is spatio-temporal, and is subjected to vehicle interference noise. Recent researches [42,43] proposed some methods where sparse vibration data based on mobile sensor networks were used to successfully achieve feature identification of bridge structures. Therefore, future work can be focused on investigating a more robust extraction method of the high-dimensional feature information of wheel-rail dynamic system around the method of VBoW model and sparse representation using vehicle mobile sensor networks.