1 Introduction

As a fundamental parameter for vehicle design and safety assessment, tire load is an essential boundary condition for vehicle development. The measurement of tire loads is essential to further optimize the design, mechanical strength, and operational performance of vehicles. In addition, the detection of vehicle system status plays a crucial role in vehicle safety. Therefore, as a direct result of vehicle–road interaction, tire load is the main focus of data monitoring.

The methods of tire load monitoring include direct measurement method and indirect identification method. In the direct measurement method, a six-component force sensor [3] is commonly used for real-time, dynamic, simultaneous, and precise measurement of the six-component force on the tires, but the sensor needs to be custom-made according to the wheel center, which is very expensive, making its wide application difficult. Therefore, some indirect measurement methods of tire loads have been proposed. Law et al. [5] identified the vertical moving loads of a train passing through a bridge at a constant speed using a time-domain algorithm based on the bending moment and vibration acceleration of the bridge. After establishing a moving load identification model in the frequency domain using the optimal state estimation method, they identified the power spectrum of the moving loads and then obtained the time-history curve of the moving loads using Fourier transform, thus generating a moving load identification system. The generated system showed certain accuracy and stability in practical engineering applications [6]. Yu et al. [13] used a time–frequency domain identification method to identify moving loads from responses, mainly compared the identification results of the pseudo-inverse method and the singular-value decomposition method, and discussed the impacts of load spacing, moving speed, measurement noise, and other parameters on load identification results. Xia et al. [12] proposed an inverse model for wheel–rail contact forces between the wheels of a railway wagon and a rail track. In virtue of the dynamic response data of wagon bodies, they inverted the wheel–rail contact forces. However, the high-frequency components of the wheel–rail contact forces could not be well predicted due to the isolation of suspension devices. Baffet et al. [1] proposed an adaptive tire force model considering the variation of road friction. By combining the vehicle model with the tire force model, they constructed an adaptive observer to calculate the lateral tire force, vehicle side-slip angle, and road friction. Dakhlallah et al. [2] proposed a method to estimate the vehicle dynamic state and tire–road forces using an extended Kalman filter, and used a friction model to evaluate the friction coefficient and side-slip angle based on the estimated longitudinal and transverse forces. All aforementioned methods were based on the physical models, which simplified some complex nonlinear systems in modeling, resulting in lower identification accuracy, poorer identification results, and more demanding identification conditions.

With the development of artificial intelligence in recent years, data-driven load identification methods have attracted great attention. Li et al. proposed a method for continuous measurement of wheel–rail contact forces based on a radial primary function network, providing a new way of studying load identification. Nonetheless, the model failed to achieve satisfactory results due to the performance of the radial basis neural network [7]. Zhang et al. [14] applied a BP neural network to wheel–rail load identification and optimized the model with genetic algorithms to complete the load identification, but the obtained results showed a significant error. Jintun et al. [9] established a convolutional neural network (CNN) model to invert the wheel–rail forces. Comparative analysis of the inversion results of transverse axle forces revealed that the model could effectively fit trend variations and data indicators under both straight and curved rail driving conditions. At present, most studies use simple neural network models, which have much room for improvement in terms of identification accuracy. Also, most are conducted on a given dataset and hence lack the applicability to different mass data, resulting in unknown predictive effects for data under poor noise conditions and lacking practical application value to some extent.

Additionally, the construction of sample sets acts as the basis for neural network training. Different construction methods of sample sets also have an impact on the structural design and effectiveness of neural network models. Many scholars have studied the methods of optimizing sample construction. Yu and Chan [13] summarized the construction method of training sample sets in the support vector machine classification algorithm and proposed a virtual sample generation method based on Gaussian distribution to address the problem of the unbalanced distribution of training samples. This method could bring the a priori knowledge into full play and generate enough virtual samples with reasonable labels, thus effectively improving the classification accuracy. By setting a reasonable Euclidean distance similarity threshold, Shen et al. [7] constructed a sample set with structural anisotropy and diversity for training; the training time reduced to approximately half of the time used in normal methods.

In this study, a load identification method based on one-dimensional convolutional neural network (1D CNN) and bidirectional gated recurrent unit (BiGRU) was proposed. The CNN was used to effectively extract multidimensional spatial features of signals. The GRU, as a variant of the recurrent neural network (RNN), showed excellent performance in processing the time-series data, while the BiGRU could capture features in both directions, enabling better extraction of implicit information from the time-series data. The combination of the CNN and BiGRU greatly enhanced the spatial and temporal feature extraction capabilities for multidimensional time-series data. In the proposed method, a priori information of the radial loads on tires was considered, and the feature set constructed based on the vibration response, position, and operating condition, and other information of vehicles was taken as input. Also, the CNN was used to extract the association of the features in high-dimensional space to form high-dimensional prediction feature vectors, which were then input into the BiGRU network for training, thereby outputting the final load prediction results.

Considering the tram of Nanhai as an example, this study used the proposed method to establish a prediction model for the vertical forces of running wheels. The results showed that, compared with other existing methods, the proposed method effectively reduced the error of load identification, improved identification accuracy, and showed strong robustness, indicating that relatively satisfying load identification results could be obtained for input data of different noise levels and degrees of abnormality.

2 Load Identification Methods Based on Deep Learning

2.1 1D CNN

CNN is a class of feedforward neural networks with a deep structure and involving convolutional computation. It consists of an input layer, a convolution layer, a pooling layer, a fully connected layer, and an output layer. 1D CNN is usually used to process one-dimensional time-series signals. The basic architecture of a 1D CNN is shown in Fig. 1. The network has a total of l layers. For the input time series \({\mathbf{x}} = [x_{1} ,x_{2} ,x_{3} , \ldots {\text{,x}}_{n} ]^{T}\), the features are first extracted through the convolution layer. Different 1D convolutional kernels are used to extract different feature signals by convolving local regions. xi of the ith layer can be obtained from the output \({\mathbf{x}}_{i - 1}\) of the previous layer as follows:

$${\mathbf{x}}_{i} = f_{i} \left( {{\mathbf{w}}_{i} \otimes {\mathbf{x}}_{i - 1} + {\mathbf{b}}_{i} } \right),$$
(1)

where wi is the weight matrix of the ith layer, bi is the bias matrix of the ith layer, and fi is the activation function of the ith layer. The convolution layer and the fully connected layer in this study used the LeakyReLU function as the activation function, and the LeakyReLU function had the following form:

$$f(x) = \left\{ {\begin{array}{*{20}l} {ax,} \hfill & {x < 0} \hfill \\ {x,} \hfill & {x \ge 0} \hfill \\ \end{array} } \right..$$
(2)
Fig. 1
figure 1

Basic architecture of 1D CNN

The LeakyReLU function inherits the advantages of the ReLU function while addressing the local dying neuron problem of the ReLU. As the LeakyReLU function has a small positive slope in the negative region, backpropagation can occur even in the case of negative input values.

The feature signals extracted from the convolution layer were fed to the pooling layer. A pooling operation was used to select the most important features of the sequences at the convolution layer and reduce the dimension of features, thereby simplifying the computational complexity of the network. In this study, maximum pooling was conducted in the pooling operation. Multiple convolution layers and pooling layers were designed as required, and the output of the last pooling layer was taken as input to the fully connected layer. The final output of the fully connected layer was expressed as:

$${\mathbf{y}} = f_{l} \left( {{\mathbf{w}}_{l} \otimes {\mathbf{x}}_{l - 1} + {\mathbf{b}}_{l} } \right),$$
(3)

where wl is the weight matrix of the lth layer, bl is the bias matrix of the lth layer, and fl is the activation function of the lth layer.

The error between the results of the output layer and the actual value was passed backward, and then the error gradient was calculated and the weights and thresholds were updated until the error satisfied the condition. Then, the training ended. The loss function of the mean square error (MSE) was as follows:

$$L = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left( {{\mathbf{y}}_{i \, }^{{{\text{true}}}} - {\mathbf{y}}_{i \, }^{{{\text{pred}}}} } \right)^{2} } ,$$
(4)

where \({\mathbf{y}}_{i \, }^{{{\text{true}}}}\) is the true data label, and \({\mathbf{y}}_{i \, }^{{{\text{pred}}}}\) is the final output of the network.

2.2 BiGRU

GRU is a type of RNN. RNN is a time-dependent network designed to process sequential information, with a memory that stores historical information. As a special type of RNN, long and short-term memory network (LSTM) can enable RNN to remember historical information and selectively forget some unimportant information while retaining long-term dependent information through a gating mechanism. As a variant of LSTM, GRU uses fewer parameters, shows easier convergence, higher training efficiency, and reduced risk of overfitting, and can effectively solve the problem of gradient disappearance in RNNs, making it an ideal tool for building larger networks.

GRU uses the gating mechanism with a reset gate and an update gate to control gradient information propagation and mitigate the vanishing gradient problem. The reset gate and update gate are used to capture the short-term and long-term dependencies in the time series, respectively. The input xt of the current time step and the hidden state ht-1 of the previous time step are used as the input of the reset gate and update gate, and the output is calculated using the fully connected layer with a sigmoid activation function.

The output rt of the reset gate is expressed as:

$${\mathbf{r}}_{t} = \sigma \left( {{\mathbf{w}}_{r} \otimes \left[ {{\mathbf{h}}_{t - 1} ,{\mathbf{x}}_{t} } \right] + {\mathbf{b}}_{r} } \right).$$
(5)

The output zt of the update gate is expressed as:

$${\mathbf{z}}_{t} = \sigma \left( {{\mathbf{w}}_{z} \otimes \left[ {{\mathbf{h}}_{t - 1} ,{\mathbf{x}}_{t} } \right] + {\mathbf{b}}_{z} } \right).$$
(6)

Before calculating the hidden state ht at the current time step, the GRU first calculates the candidate hidden state \({\tilde{\mathbf{h}}}_{t}\) as follows:

$${\tilde{\mathbf{h}}}_{t} = \tanh \left( {{\mathbf{w}}_{h} \otimes \left[ {{\mathbf{r}}_{t} \times {\mathbf{h}}_{t - 1} ,{\mathbf{x}}_{t} } \right] + {\mathbf{b}}_{h} } \right),$$
(7)

where \({\mathbf{w}}_{r}\), \({\mathbf{w}}_{z}\), and \({\mathbf{w}}_{h}\) are weight matrices, and \({\mathbf{b}}_{r}\), \({\mathbf{b}}_{z}\), and \({\mathbf{b}}_{h}\) are bias matrices.

ht – 1 contains the entire history information of the time series from a point in the front end to the last time step. Equation (3) shows that the output \({\mathbf{r}}_{t}\) of the reset gate controls the transfer of the hidden state ht – 1 from the previous time step to the current time step. Therefore, the reset gate can be used to discard meaningless history information.

The output \({\mathbf{z}}_{t}\) of the update gate is used as a weight to combine the hidden state ht-1 of the previous time step and the candidate hidden state \({\tilde{\mathbf{h}}}_{t}\) of the current time step. Then, the hidden state ht of the current time step can be obtained as follows:

$${\mathbf{h}}_{t} = \left( {1 - {\mathbf{z}}_{t} } \right) \times {\mathbf{h}}_{t - 1} + {\mathbf{z}}_{t} \times {\tilde{\mathbf{h}}}_{t} .$$
(8)

The BiGRU network consists of a forward GRU and a reverse GRU network. The BiGRU network extracts the features of time-series signals in forward and inverse time orders, and remembers the forward and reverse history information from the beginning or end of the time segment to the current time step. The hidden state \({\mathbf{h}}_{t}\) is calculated based on the output \({\mathbf{h}}_{Lt}\) of the forward propagation layer and the output \({\mathbf{h}}_{Rt}\) of the backward propagation layer, as shown in Eq. (9).

$${\mathbf{h}}_{t} = m\left( {{\mathbf{w}}_{L} {\mathbf{h}}_{Lt} + {\mathbf{w}}_{R} {\mathbf{h}}_{Rt} } \right),$$
(9)

where \({\mathbf{w}}_{L}\) and \({\mathbf{w}}_{R}\) are the weight matrices of \({\mathbf{h}}_{Lt}\) and \({\mathbf{h}}_{Rt}\), respectively, and \(m\) is the connecting function. The BiGRU network makes full use of the information during the time before and after a single moment, enabling more comprehensive feature extraction and improving prediction accuracy.

2.3 Network Design Based on 1D CNN and BiGRU

The proposed method integrated and improved the 1D CNN and BiGRU models, and then applied the improved model to the radial load identification of rubber-tired vehicles, which fully used the spatial feature extraction capability of the CNN model and the temporal feature extraction capability of the BiGRU model. Besides, a BN layer was added to the CNN to normalize the data distribution and enhance the generalization capability and robustness of the model. A dropout layer was added to the fully connected layer to randomly mask some of the neuron outputs to prevent overfitting. An Adam optimization algorithm with an adaptive learning rate was used to train the model for the identification of radial loads on vehicles.

In the following sections, the 1D CNN–BiGRU model was used as an example, whose network structure is shown in Fig. 2. The input of a single sample was the time-series data of 7 channels and 100 time steps, and the output was the load value at the middle moment. The input of the samples was sliding-sampled again, with a sampling window size of 10 × 7 and a step size of 1. A total of 90 sets of 10 × 7 segments were obtained in time order. The aforementioned small segments were convolved twice, and the maximum pooling was used to extract the spatial features of the data. The one-dimensional convolutional kernel was within [1, 1] with a step size of 1, and the depths in the two convolutions were 7 and 128. The activation function was LeakyReLU. The maximum pooling layer was within [2, 1] with a step size of 1, and no padding was performed during the convolution and pooling process. At last, the feature data of 64 channels were obtained.

Fig. 2
figure 2

Load identification network structure based on 1D CNN and BiGRU

The 90 sets of feature data with 64 channels were fed into the BiGRU layer, and the implicit information of the time-series data was extracted in both directions. The number of implicit cells in the BiGRU layer was initially set to 20. The final load prediction was output after going through the fully connected layer three times, with the dropout rate set to 0.2 and the number of fully connected cells set to 64, 8, and 1, respectively.

The MSE loss function was used in the network training process, which was calculated as follows:

$${\text{MSE}} (\hat{y},y) = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left( {y_{{}}^{(i)} - \hat{y}_{ \, }^{(i)} } \right)^{2} } .$$
(10)

2.4 Load Identification Framework

The aforementioned deep learning-based load identification method was applied to rubber-tired rail vehicles. In the proposed model, a priori information of radial load data of tires was considered, and a feature set constructed based on the vehicle vibration response, vehicle position, operating status, and other information was taken as the input. The theoretical framework of the model is shown in Fig. 3. The load excitation generated by track irregularities was input into the vehicle system to stimulate the corresponding vibration response. By combining multiple sources of information, such as vehicle position and operating status, and after data preprocessing and feature selection, a subset of valid features was retained as the input of a sample, and the load value was used as the label of the sample. Then, a sample set with multiple time steps as the input and a single time step as the output was constructed for the training of the identification model of radial loads of tires based on 1D CNN and BiGRU. With 1D CNN and BiGRU combined, the proposed model could fully extract the spatial and temporal features of signals and show high noise immunity. MSE, R2 score, and other parameters were used to evaluate the identification error and accuracy of the model, and the generalization performance of the model under different operating conditions and its robustness to different noise levels and degrees of abnormality of the input data were analyzed and improved to obtain the final model.

Fig. 3
figure 3

Theoretical framework of the model

The development of reasonable training samples is a prerequisite for neural networks to achieve good results. In previous studies, the construction of training samples had no basis, and the selection of input training samples was blind. In this study, the training set was constructed based on the observability of the state space. This allowed the training set input of the neural network to be the same as that of the KF-based load method, making it easier to compare the identification results of the neural network with that of the KF-based load identification method.

3 Development of Training Samples

3.1 Vehicle Theoretical Model

Figure 4 shows a simplified 11 degrees-of-freedom vertical model of a rubber-tired road vehicle. Table 1 summarizes the kinetic parameters of this model [4].

Fig. 4
figure 4

Eleven degrees-of-freedom vertical kinetic model of rubber-tired vehicle

Table 1 Kinetic parameters of the 11 degrees-of-freedom system

The equation of motion of the vertical model is expressed as follows:

$${\mathbf{M\ddot{x}}}(t) + {\mathbf{C\dot{x}}}(t) + {\mathbf{Kx}}(t) = {\mathbf{U}}(t),$$
(11)

where

$${\mathbf{M}} = {\text{diag}}\left( {\left[ {\begin{array}{*{20}c} {m_{c} } & {I_{cx} } & {I_{cy} } & {m_{b} } & {I_{bx} } & {m_{b} } & {I_{bx} } \\ \end{array} } \right]} \right),$$
$${\mathbf{C}} = \left[ {\begin{array}{*{20}c} {4c_{s} } & 0 & 0 & { - 2c_{s} } & 0 & { - 2c_{s} } & 0 \\ 0 & {4c_{s} d_{1}^{2} } & 0 & 0 & { - 2c_{s} d_{1}^{2} } & 0 & { - 2c_{s} d_{1}^{2} } \\ 0 & 0 & {4c_{s} l_{c}^{2} } & { - 2c_{s} l_{c} } & 0 & {2c_{s} l_{c} } & 0 \\ { - 2c_{s} } & 0 & { - 2c_{s} l_{c} } & {2c_{s} } & 0 & 0 & 0 \\ 0 & { - 2c_{s} d_{1}^{2} } & 0 & 0 & {2c_{s} d_{1}^{2} } & 0 & 0 \\ { - 2c_{s} } & 0 & {2c_{s} l_{c} } & 0 & 0 & {2c_{s} } & 0 \\ 0 & { - 2c_{s} d_{1}^{2} } & 0 & 0 & 0 & 0 & {2c_{s} d_{1}^{2} } \\ \end{array} } \right],$$
$${\mathbf{K}} = \left[ {\begin{array}{*{20}c} {4k_{s} } & 0 & 0 & { - 2k_{s} } & 0 & { - 2k_{s} } & 0 \\ 0 & {4k_{s} d_{1}^{2} } & 0 & 0 & { - 2k_{s} d_{1}^{2} } & 0 & { - 2k_{s} d_{1}^{2} } \\ 0 & 0 & {4k_{s} l_{c}^{2} } & { - 2k_{s} l_{c} } & 0 & {2k_{s} l_{c} } & 0 \\ { - 2k_{s} } & 0 & { - 2k_{s} l_{c} } & {2k_{s} + 2k_{t} } & 0 & 0 & 0 \\ 0 & { - 2k_{s} d_{1}^{2} } & 0 & 0 & {2k_{s} d_{1}^{2} + 2k_{t} d_{1}^{2} } & 0 & 0 \\ { - 2k_{s} } & 0 & {2k_{s} l_{c} } & 0 & 0 & {2k_{s} + 2k_{t} } & 0 \\ 0 & { - 2k_{s} d_{1}^{2} } & 0 & 0 & 0 & 0 & {2k_{s} d_{1}^{2} + 2k_{t} d_{1}^{2} } \\ \end{array} } \right],$$
$${\mathbf{x}}(t) = \left[ {\begin{array}{*{20}c} {z_{c} } & {\alpha_{c} } & {\beta_{c} } & {z_{b1} } & {\alpha_{b1} } & {z_{b2} } & {\alpha_{b2} } \\ \end{array} } \right]^{{\text{T}}} ,$$
$${\mathbf{U}}(t) = \left[ {\begin{array}{*{20}c} 0 & 0 & 0 & {2kz_{10} } & {2k_{t} d_{1}^{2} \alpha_{10} } & {2kz_{20} } & {2k_{t} d_{1}^{2} \alpha_{20} } \\ \end{array} } \right].$$

The tire force is related to the tire stiffness and the relative displacement between the tire mass and the road, as expressed by Eq. (12).

$$\left\{ \begin{gathered} F_{fl} = k_{t} (z_{b1} + d_{1} \alpha_{b1} - (z_{10} + d_{1} \alpha_{10} )) \hfill \\ F_{fr} = k_{t} (z_{b1} - d_{1} \alpha_{b1} - (z_{10} - d_{1} \alpha_{10} )) \hfill \\ F_{rl} = k_{t} (z_{b2} + d_{1} \alpha_{b2} - (z_{20} + d_{1} \alpha_{20} )) \hfill \\ F_{rr} = k_{t} (z_{b2} - d_{1} \alpha_{b2} - (z_{20} - d_{1} \alpha_{20} )) \hfill \\ \end{gathered} \right.,$$
(12)

where \(F_{fl}\), \(F_{fr}\), \(F_{rl}\) and \(F_{rr}\) are the front left and right tire forces and the rear left and right tires, respectively.

The equation of motion of the half-car model system, which is expressed by Eq. (11), can be rewritten using the tire forces as the external input force, as follows:

$${\mathbf{M\ddot{x}}}(t) + {\mathbf{C\dot{x}}}(t) + {\mathbf{K^{\prime}x}}(t) = {\mathbf{U^{\prime}}}(t),$$
(13)

where

$${\mathbf{K^{\prime}}} = \left[ {\begin{array}{*{20}c} {4k_{s} } & 0 & 0 & { - 2k_{s} } & 0 & { - 2k_{s} } & 0 \\ 0 & {4k_{s} d_{1}^{2} } & 0 & 0 & { - 2k_{s} d_{1}^{2} } & 0 & { - 2k_{s} d_{1}^{2} } \\ 0 & 0 & {4k_{s} l_{c}^{2} } & { - 2k_{s} l_{c} } & 0 & {2k_{s} l_{c} } & 0 \\ { - 2k_{s} } & 0 & { - 2k_{s} l_{c} } & {2k_{s} + 2k_{t} } & 0 & 0 & 0 \\ 0 & { - 2k_{s} d_{1}^{2} } & 0 & 0 & {2k_{s} d_{1}^{2} } & 0 & 0 \\ { - 2k_{s} } & 0 & {2k_{s} l_{c} } & 0 & 0 & {2k_{s} } & 0 \\ 0 & { - 2k_{s} d_{1}^{2} } & 0 & 0 & 0 & 0 & {2k_{s} d_{1}^{2} } \\ \end{array} } \right],$$
$${\mathbf{U^{\prime}}} = \left[ {\begin{array}{*{20}c} 0 & 0 & 0 & { - (F_{fl} + F_{fr} )} & { - d_{1} (F_{fl} - F_{fr} )} & { - (F_{rl} + F_{rr} )} & { - d_{1} (F_{rl} - F_{rr} )} \\ \end{array} } \right].$$

In Eq. (16), the tire forces are explicitly expressed as the input terms in the equation of motion. In addition to the tire forces, the vehicle responses and their corresponding first-order derivatives are also included in the state vector X, as follows:

$${\mathbf{X}} = [\begin{array}{*{20}c} {z_{c} } & {\alpha_{c} } & {\beta_{c} } & {z_{b1} } & {\alpha_{b1} } & {z_{b2} } & {\alpha_{b2} } & {\dot{z}_{c} } & {\dot{\alpha }_{c} } & {\dot{\beta }_{c} } & {\dot{z}_{b1} } & {\dot{\alpha }_{b1} } & {\dot{z}_{b2} } & {\dot{\alpha }_{b2} } & {F_{fl} } & {F_{fr} } & {F_{rl} } & {F_{rr} } \\ \end{array} ]^{{\text{T}}} .$$

Based on the equation of motion expressed by Eq. (13) the continuous system state equation can be expressed as follows:

$${\dot{\mathbf{X}}} = {\mathbf{AX}}.$$
(14)

A is the continuous system transition matrix:

$${\mathbf{A = }}\left[ {\begin{array}{*{20}c} {{\mathbf{0}}_{7 \times 7} } & {{\mathbf{I}}_{7 \times 7} } & {{\mathbf{0}}_{7 \times 4} } \\ {{\mathbf{ - M}}^{{{\mathbf{ - 1}}}} {\mathbf{K^{\prime}}}} & {{\mathbf{ - M}}^{{{\mathbf{ - 1}}}} {\mathbf{C}}} & {{\mathbf{Z}}_{7 \times 4} } \\ {{\mathbf{0}}_{4 \times 7} } & {{\mathbf{0}}_{4 \times 7} } & {{\mathbf{0}}_{4 \times 4} } \\ \end{array} } \right],$$

where O is a zero-matrix, I is an identity matrix, and the following relationship holds for Z:

$${\mathbf{Z}} = {\mathbf{ - M}}^{{{\mathbf{ - 1}}}} \frac{{\partial ({\mathbf{U^{\prime}}})}}{{\partial (F_{fl} ,F_{fr} ,F_{rl} ,F_{fr} )}} = \left[ {\begin{array}{*{20}c} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ {1/m_{b} } & {1/m_{b} } & 0 & 0 \\ {d_{1} /I_{bx} } & { - d_{1} /I_{bx} } & 0 & 0 \\ 0 & 0 & {1/m_{b} } & {1/m_{b} } \\ 0 & 0 & {d_{1} /I_{bx} } & { - d_{1} /I_{bx} } \\ \end{array} } \right].$$
(15)

Using the Euler discretization method, the continuous system transition equation is discretized to the form of Eq. (14), where the following relationship holds:

$${\mathbf{A}}_{d} {\mathbf{ = }}e^{{{\mathbf{A}}dt}} \approx {\mathbf{I}} + {\mathbf{A}}dt.$$
(16)

The acceleration of each degree of freedom of the vehicle body and axle bridge is measured to estimate tire forces. As we all know, it is difficult to obtain angular acceleration, which can be obtained by changing the multi-point matrix in this paper. Therefore, to simplify the system, the angular acceleration is directly used as the measurement quantity. The observations of the system in Eq. (13) were set as:

$${\mathbf{y}} = [\begin{array}{*{20}c} {\ddot{z}_{c} } & {\ddot{\alpha }_{c} } & {\ddot{\beta }_{c} } & {\ddot{z}_{b1} } & {\ddot{\alpha }_{b1} } & {\ddot{z}_{b2} } & {\ddot{\alpha }_{b2} } \\ \end{array} ]^{T} .$$
(17)

The observation matrix C becomes as follows:

$$C = \left( {\begin{array}{*{20}c} { - \frac{{4k_{s} }}{{m_{c} }}} & 0 & 0 & {\frac{{2k_{s} }}{{m_{c} }}} & 0 & {\frac{{2k_{s} }}{{m_{c} }}} & 0 & { - \frac{{4c_{s} }}{{m_{c} }}} & 0 & 0 & {\frac{{2c_{s} }}{{m_{c} }}} & 0 & {\frac{{2c_{s} }}{{m_{c} }}} & 0 & 0 & 0 & 0 & 0 \\ 0 & { - \frac{{4k_{s} d_{1}^{2} }}{{I_{cx} }}} & 0 & 0 & {\frac{{2k_{s} d_{1}^{2} }}{{I_{cx} }}} & 0 & {\frac{{2k_{s} d_{1}^{2} }}{{I_{cx} }}} & 0 & { - \frac{{4c_{s} d_{2}^{2} }}{{I_{cx} }}} & 0 & 0 & {\frac{{2c_{s} d_{2}^{2} }}{{I_{cx} }}} & 0 & {\frac{{2c_{s} d_{2}^{2} }}{{I_{cx} }}} & 0 & 0 & 0 & 0 \\ 0 & 0 & { - \frac{{4k_{s} l_{c}^{2} }}{{I_{cy} }}} & {\frac{{2k_{s} l_{c} }}{{I_{cy} }}} & 0 & { - \frac{{2k_{s} l_{c} }}{{I_{cy} }}} & 0 & 0 & 0 & { - \frac{{4c_{s} l_{c}^{2} }}{{I_{cy} }}} & {\frac{{2c_{s} l_{c} }}{{I_{cy} }}} & 0 & { - \frac{{2c_{s} l_{c} }}{{I_{cy} }}} & 0 & 0 & 0 & 0 & 0 \\ {\frac{{2k_{s} }}{{m_{b} }}} & 0 & {\frac{{2k_{s} l_{c} }}{{m_{b} }}} & { - \frac{{2k_{s} }}{{m_{b} }}} & 0 & 0 & 0 & {\frac{{2c_{s} }}{{m_{b} }}} & 0 & {\frac{{2c_{s} l_{c} }}{{m_{b} }}} & { - \frac{{2c_{s} }}{{m_{b} }}} & 0 & 0 & 0 & {\frac{1}{{m_{b} }}} & {\frac{1}{{m_{b} }}} & 0 & 0 \\ 0 & {\frac{{2k_{s} d_{1}^{2} }}{{I_{bx} }}} & 0 & 0 & { - \frac{{2k_{s} d_{1}^{2} }}{{I_{bx} }}} & 0 & 0 & 0 & {\frac{{2c_{s} d_{2}^{2} }}{{I_{bx} }}} & 0 & 0 & { - \frac{{2c_{s} d_{2}^{2} }}{{I_{bx} }}} & 0 & 0 & { - \frac{{d_{b} }}{{I_{bx} }}} & {\frac{{d_{b} }}{{I_{bx} }}} & 0 & 0 \\ {\frac{{2k_{s} }}{{m_{b} }}} & 0 & { - \frac{{2k_{s} l_{c} }}{{m_{b} }}} & 0 & 0 & { - \frac{{2k_{s} }}{{m_{b} }}} & 0 & {\frac{{2c_{s} }}{{m_{b} }}} & 0 & { - \frac{{2c_{s} l_{c} }}{{m_{b} }}} & 0 & 0 & { - \frac{{2c_{s} }}{{m_{b} }}} & 0 & 0 & 0 & {\frac{1}{{m_{b} }}} & {\frac{1}{{m_{b} }}} \\ 0 & {\frac{{2k_{s} d_{1}^{2} }}{{I_{bx} }}} & 0 & 0 & 0 & 0 & { - \frac{{2k_{s} d_{1}^{2} }}{{I_{bx} }}} & 0 & {\frac{{2c_{s} d_{2}^{2} }}{{I_{bx} }}} & 0 & 0 & 0 & 0 & { - \frac{{2c_{s} d_{2}^{2} }}{{I_{bx} }}} & 0 & 0 & { - \frac{{d_{b} }}{{I_{bx} }}} & {\frac{{d_{b} }}{{I_{bx} }}} \\ \end{array} } \right).$$
(18)

3.2 Observability

For discrete-based load identification, the observability of the system status needs to be assessed to determine whether the selected output \({\mathbf{y}}\) can predict the state quantities, such as \(F_{fl}\), \(F_{fr}\), \(F_{rl}\), and \(F_{rr}\).

The observability represents the ability of the output y(t) to reflect the state vector x(t), and it has no direct relation with the control effect \({\mathbf{u}}(t)\). Therefore, during the analysis of the observability problem, it was feasible to disregard the control effect and simply consider the homogeneous state equation and the output equation. A system is fully observable if, for any given input u (known), at a finite observation time \(t_{f} > t_{0}\), the state vector x(t0) of the system at the initial moment can be uniquely determined from the output y(t) during the period \([t_{0} ,t_{f} ]\). For the discrete-time system in Eq. (14), the relationship could be transformed into:

$$\begin{array}{*{20}c} {{\mathbf{y}}(t_{0} ) = {\mathbf{Cx}}(t_{0} )} \\ {{\mathbf{y}}(t_{1} ) = {\mathbf{CJx}}(t_{1} )} \\ \vdots \\ {{\mathbf{y}}(t_{n - 1} ) = {\mathbf{CJ}}^{n - 1} {\mathbf{x}}(t_{n - 1} )} \\ \end{array} .$$
(19)

It could be transformed into the matrix form as follows:

$${\mathbf{Y}} = {\mathbf{N}} \cdot {\mathbf{x}}(t_{0} ),$$
(20)

where \({\mathbf{N}} = \left[ {\begin{array}{*{20}l} {\mathbf{C}} \hfill & {{\mathbf{CJ}}} \hfill & \cdots \hfill & {{\mathbf{CJ}}^{n - 1} } \hfill \\ \end{array} } \right]^{{\text{T}}}\), which is called the system observability discriminant matrix.

The initial state \({\mathbf{x}}(t_{0} )\) of the system could be expressed as:

$${\mathbf{x}}(t_{0} ) = {\mathbf{N}}^{ - 1} {\mathbf{Y}}.$$
(21)

The sufficient condition for \({\mathbf{x}}(t_{0} )\) to have a unique solution is that the rank of the system observability discriminant matrix is n:

$${\text{rank}} ({\mathbf{N}}) = n,$$
(22)

where n is the dimension of the system state vector. In this case, the state of the discrete-time system is fully observable. If the system observability discriminant condition is not satisfied [i.e., \({\text{rank}} (N) = n_{1} < n\)], the states of the system are partially observable, and the number of observable states of the system is \(n_{1}\).

By calculating the rank of the observability adjustment matrix of the system shown in Eq. (21), \({\text{rank}} ({\mathbf{N}}) = 11 < 18\) could be obtained. Therefore, the system was not fully observable but had only 11 states observable. The observability of each state variable is determined by examining the rank of the matrix after removing the corresponding column of the matrix. If the rank is reduced by removing this column, the state is observable. The system was decomposed by selecting 11 linearly independent row vectors to construct a nonsingular transformation matrix, where the state vector of the observable 11-dimensional subsystem was as follows:

$${\mathbf{\dot{\tilde{x}}}}_{{\mathbf{1}}} {\mathbf{ = }}\left[ {\begin{array}{*{20}c} {z_{c} } & {\alpha_{c} } & {\beta_{c} } & {\dot{z}_{b1} } & {\dot{\alpha }_{b1} } & {\dot{z}_{b2} } & {\dot{\alpha }_{b2} } & {F_{fl} } & {F_{fr} } & {F_{rl} } & {F_{rr} } \\ \end{array} } \right]^{T} .$$

The state vector of the unobservable seven-dimensional subsystem was \({\mathbf{\dot{\tilde{x}}}}_{2} {\mathbf{ = }}\left[ {\begin{array}{*{20}c} {z_{b1} } & {\alpha_{b1} } & {z_{b2} } & {\alpha_{b2} } & {\dot{z}_{c} } & {\dot{\alpha }_{c} } & {\dot{\beta }_{c} } \\ \end{array} } \right]^{T}\).

Therefore, when selecting the current measured parameters, all tire forces are observable, which will be used as the basis for selecting the deep learning training set.

3.3 Construction of Sample Sets

The feature set was constructed from the multisource time-series data through feature selection, and the normalized seven-channel data were stitched, intercepted, and slide-sampled. The load response at each moment was used as the label of a sample, and the vibration response, vehicle position, and operating status information for a total of l time steps before and after that moment were used as the features of a sample (the previously selected state variables) to construct an input–output sample of the form \(l \times m\), as shown in Fig. 5. Contrary to the samples with the input and output at a single moment, the samples were constructed taking into account the reality that the object of study was time-series signals, rather than just a regression of ordinary data. Sampling by time segment allowed the input of each sample to retain the feature information in the dimension of time, making it possible to predict the output from time segments instead of individual moments. This could effectively prevent the possible existence of phase difference between the actual measurement points, corruption, or anomaly of the data at a single moment, and other possible conditions, thereby enhancing reasonableness and robustness.

Fig. 5
figure 5

Construction of sample sets

3.4 Processing of Input Data

As the input data from multiple sources have different dimensions, converting the original data into dimensionless index measurement values is necessary to improve the training effect. For this reason, the data are normalized or standardized. The data may have many abnormal shock values due to the large quantity and complexity of the operating conditions of the vehicle system, and the extreme values may have a significant impact on the normalization effect. Hence, the Z-score was employed in this study:

$$E.\quad x^{\prime } = \frac{x - \mu }{\sigma }.$$
(23)

4 Load Identification Performance

Based on the simulation data set of a certain train, the performance of the proposed 1D CNN-BiGRU model is compared with the Kalman filter model and other neural network models. The APM 300 uses high-pressure pneumatic tires for running and solid rubber tires for guidance. The structure of the running section is shown in Fig. 6. The radial loads investigated in this study were mainly the vertical forces between the running wheel, running surface, guide wheel, and guide rail, as shown in Fig. 6.

Fig. 6
figure 6

Structure and principle of APM 300 running section

The original signal-to-noise ratio of the data is 50 dB, and the operating condition is 40 km/h straight line. The track irregularities were in the US Class V spectrum. Taking the left front wheel of the first axle as an example, using the above mentioned observability determination method to select seven features of state parameters that could be easily measured as the input of the prediction mode, as shown in Table 2. The running wheel vertical force and the guide wheel radial force were used as the model output, as shown in Table 2.

Table 2 Settings of features and labels

4.1 Accuracy Analysis

The accuracy of each model was evaluated using the coefficient of determination R2 score to compare the performance of the 1D CNN–BiGRU model proposed in this study with that of the Kalman filtering model and other neural network models:

$$R^{2} = 1 - \frac{{\frac{1}{n}\left( {\sum\nolimits_{i = 1}^{n} {\left( {y^{(i)} - \hat{y}^{(i)} } \right)^{2} } } \right)}}{{\frac{1}{n}\left( {\sum\nolimits_{i = 1}^{n} {\left( {y^{(i)} - \overline{y}} \right)^{2} } } \right)}} = 1 - \frac{{{\text{MSE}} (\hat{y},y)}}{{{\text{Var}} (y)}},$$
(24)

where \(\hat{y}^{(i)}\) is the predicted value of the model and \(y^{(i)}\) is the actual value. The larger the \(R^{2}\) value, the better the results of the regression model.

The 1D CNN-BiGRU model shown in Fig. 2 is used to identify the vertical forces Fw1–Fw4 of the four running wheels. During the training process, the changes of training error and training accuracy of each running wheel are shown in Fig. 7a, and the changes of test error and test accuracy are shown in Fig. 7b. One of the training cycles refers to the process of completing forward computation and backpropagation after all the data has been sent to the network. It can be seen from Fig. 6 that the mse value of the recognition error of the vertical force of the running wheel in each position of the model in this paper is below 0.2, and the R2-score is above 0.8, which has a high recognition accuracy.

Fig. 7
figure 7

Error and accuracy curves for different running wheels

Figure 8 shows a comparison of the predicted and actual vertical force values for each running wheel test set. As observed, the predicted values obtained using the proposed method were the closest to the actual values with regard to the running wheel vertical force, with the smallest error.

Fig. 8
figure 8

Comparison of prediction results of different running wheels

4.2 Robustness Analysis

In the actual operation of the vehicle system, the signals collected by various sensors are bound to have noise, especially when the road conditions are bad, the noise component in the signal is very large, which may lead to the load recognition model with good prediction effect in the original ideal environment in the practical application of poor effect. Therefore, the load recognition model established must have the ability of anti-jamming to different levels of noise, that is, it has good robustness performance, in order to have greater engineering application value and practical significance. In this section, noise of different degrees of signal-to-noise ratio is added to the input data to examine the ability of the model to deal with different degrees of interference.

The load recognition effect of each model under different signal-to-noise ratio is shown in Fig. 9, which shows that when the signals were noise-free, all models showed excellent prediction performance, but when the noise component of the signals increased, the identification performance of all the models decreased. The identification performance of the models based on ANN and LSTM neural networks decreased the most, that of the model based on Kalman filtering decreased the second most, and that of the proposed 1D CNN–BiGRU model decreased the least.

Fig. 9
figure 9

Prediction performance of each model

MSE and R2 metrics for the training and testing sets of each model in the last training cycle are shown in Table 3. As shown in the table, the testing error of each model did not differ much from the training error. The error of the models based on ANN and LSTM neural networks was large due to a large amount of noise in the original sample data. However, the use of CNN and samples with multiple time steps as the input could allow for effective extraction of the feature information on the time segments, avoiding the negative impact of anomalous data at a single moment and enabling noise reduction. Taking xinh1 [signal-to-noise ratio (SNR) = 10 dB] as an example, under the data environment of the noise level, the final MSE of the testing set in the proposed model reached 0.0203, which was 37.7–76.4% lower than that in the other models. Also, the R2 value reached 0.9793, which was 1.3–7.1% higher than that in the other models. The aforementioned results demonstrated that the proposed model showed low identification error and high accuracy in response to the actual noisy signals, which is of high practical application value.

Table 3 Load identification error and accuracy of each model under different signals

When the original input data were noisy signals of different SNRs, the MSE of each model is shown in Fig. 10a and the R2 value of each model is shown in Fig. 10b. All MSE values were below 0.16, and all the R2 values were more than 0.98. At this point, the difference between the four models was very small. The Kalman filter also showed high identification accuracy; however, when the SNR was more than 10 dB, the identification accuracy of the Kalman filtering model was lower than that of the other four neural network models.

Fig. 10
figure 10

Performance comparison

When the noise level increased, the ANN and LSTM models decreased rapidly to a very low level in terms of identification accuracy. The CNN and the proposed 1D CNN–BiGRU model also decreased, but to a very small degree, which meant that the identification accuracy still remained relatively high. The Kalman filtering model ranked in the middle in terms of identification accuracy. When the SNR reached 0.1 dB, the R2 of the five models was 0.7535, 0.5682, 0.5753, 0.8901, and 0.919, respectively. It was concluded that the proposed model had excellent robustness and was well suited for load identification in harsh noise environments.

Ideally, each model is as good as that shown in Fig. 6. when it comes to the identification performance for noise-free data. In the case of zero noise, the predicted values obtained using the methods were close to the actual values. When the noise was 1 dB or 10 dB, if signals were noise-free, each model showed excellent prediction results, but as the noise component of the signals increased, the identification performance of each model decreased. The models based on ANN and LSTM neural networks showed the most decreased identification performance, followed by the model based on Kalman filtering, while the proposed 1D CNN–BiGRU model showed the least decreased identification performance.

4.3 Generalization Performance Analysis

The model showed excellent load identification performance in the straight-line operating condition. Different operating speeds and operating curves were set to further investigate the generalization performance of the model. With \(F_{w1}\) as an example, the load identification performance of the model under different operating conditions is shown in Fig. 11.

Fig. 11
figure 11

Load identification performance under different operating conditions

MSE and R2 of the testing set under different operating conditions are shown in Table 4. As observed, all the R2 values obtained in the proposed model were more than 0.75 for curve operating conditions with different radii, and all MSE values were less than 0.23. This indicated that the proposed model had high identification accuracy and generalization ability and was suitable for different operating conditions.

Table 4 Identification error and accuracy under different operating conditions

5 Applications

The sensor layout of the vehicle system is shown in Fig. 12. The inclination measuring gyroscope is arranged on the central floor of the vehicle body to collect the roll Angle velocity and pitch Angle velocity of the vehicle body. An accelerometer is arranged in the center of the floor of the car body to collect the vertical acceleration of the car body. A gyroscope is arranged in the center of the axle of the first axis and the second axis to collect the side rolling angular velocity of the axle. An accelerometer is arranged in the center of the axle of the first and second axes to collect the vertical acceleration of the axle.

Fig. 12
figure 12

Measuring point arrangement of vehicle system

The data collected by the sensor contains some noises and anomalies, so it needs to be preprocessed. The abnormal data caused by the road cracks and other reasons are extracted, and the data satisfying the normal distribution is used to fill it, and the effective data after cleaning is obtained. The five-point cubic smoothing method is a method of smoothing discrete data by using the least square method. The main purpose of this method is to reduce the high frequency random noise mixed with vibration signal. In this paper, five point cubic smoothing method is used to remove noise and trend items.

The load recognition model established in this paper is applied to Shanghai Pujiang APM300 vehicle to verify the model recognition effect. Using the collected data, the sample set of the radial force load identification model of the walking wheel was established, and the sample sets including 475,000 samples were finally obtained. The 1D CNN-BiGRU model was built on the Keras framework and trained. Meanwhile, ANN, LSTM, CNN and other models were trained for effect comparison.

The error of ANN and LSTM models is larger, reaching more than 0.4, while that of CNN is more than 0.3. The final test set MSE of the model proposed in this chapter reached 0.2236, which decreased by 41% compared with other models. The above results prove that the 1DCNN-BiGRu chapter model proposed in this paper can deal with the actual noisy signal, the identification error is small, the accuracy is higher than other models, and has higher engineering application value.

Figure 13 shows the comparison results between the actual radial force of running wheels and the predicted values of each model in a test set. The test set is the data within 10 s of continuous running of vehicles under this working condition, with a sampling frequency of 100 Hz, and 1000 samples in total. It can be seen from Fig. 13 that the predicted result of the model in this chapter is closest to the actual value of the radial force of the traveling wheel, with the smallest error.

Fig. 13
figure 13

Prediction results of different models

6 Conclusion

In this study, a load identification method based on the 1D CNN–BiGRU neural network was proposed. A feature set containing multiple sources of information, such as vehicle vibration response, vehicle position, and running status, was constructed to make full use of the data mining advantages of CNN to extract the potential connections of discontinuous data in the high-dimensional space. The BiGRU network was used to capture features in both directions to better extract implicit information from the time-series data. The integration of the CNN and BiGRU greatly enhanced the spatial and temporal feature extraction capability for multidimensional time-series data.

The construction methods of a training set were studied. The input of the training set was selected based on observability and could be directly compared with the KF discrimination results. In this study, the samples were constructed with multiple time steps as input and single time step as output and were sampled by time segment, so that each sample retained sufficient feature information in the dimension of time. The use of feature prediction output at a time segment rather than an individual moment effectively prevented the actual phase difference between measurement points and the possible corruption or anomaly of data at an individual moment, resulting in stronger model robustness.

This study introduced several metrics to evaluate the application value of the model, including its prediction accuracy, its generalization performance under different operating conditions, and its robustness to input data of different SNRs. The results showed that the proposed method had higher prediction accuracy, better generalization performance, and stronger robustness than KF and other network models. It indicated that the proposed method could be applied to load identification under various operating conditions and noisy environments.