1 Introduction

The prediction of the evolution of state variables in dynamical systems has been a vital component to several scientific applications such as biology, geophysics, earthquake engineering, solid mechanics, robotics, computer vision [1,2,3,4,5,6,7] etc. Black-box techniques based on data-driven mapping and development of parameterized multi-physics models describing the progression of the data have been previously utilized for making predictions on the states. This task continues to be an active area of research due to challenges on many fronts, such as, the quality and scarcity of relevant physical data, the dynamics and complexity of the system, and the reliability and accuracy of the prediction model.

On the other hand, the characterization of parameters in the multi-physics models of these dynamical systems is also critical [8,9,10,11,12,13,14]. The task is challenging due in parts to potential noise pollution captured by sensors in the system’s measured data, as well as the potential of the parameter space being high-dimensional, leading to ill-posed problems that pose difficulties in numerical solutions. Standard optimization techniques such as genetic algorithms [15, 16], simulated annealing [17], and non-linear least squares [18, 19] have been employed for parameter identification, but can be computationally expensive and may not converge for ill-posed, non-convex optimization problems that are encountered while solving inverse problems on MSK systems [15, 20].

In recent years, machine learning (ML) or deep-learning-based approaches have gained significant popularity for solving forward and inverse problems, attributed to their capability in effectively extracting complex features and patterns from data [21]. This has been successfully demonstrated in numerous engineering applications such as reduced-order modeling [22,23,24,25,26], and materials modeling [27,28,29], among others. Data-driven computing techniques that enforce constraints of conservation laws in the learning algorithms of a material database, have been developed in the field of computational mechanics [29,30,31,32,33,34,35,36,37]. More recently, physics-informed neural networks (PINNs) have been developed [11, 38, 39] to approximate the solutions of given physical equations by using neural networks (NNs). By minimizing the residuals of the governing partial differential equations (PDEs) and the associated initial and boundary conditions, PINNs have been successfully applied to solve forward problems [11, 40, 41], and inverse problems [11, 38, 42,43,44], where the unknown system characteristics are considered trainable parameters or functions [38, 45]. For biomechanics and biomedical applications [1, 46,47,48,49,50], this method has been applied extensively along with other ML techniques [51, 52]. These attempt to bridge the gap between ML-based data-driven surrogate models and the satisfaction of physical laws.

In this study, we focus on the application to musculoskeletal systems, aiming at utilizing non-invasive muscle activity measurements such as surface electromyography (sEMG) signals to predict joint kinetics or kinematics [1, 18, 19], which is of great significance to health assessment and rehabilitation purposes [15, 16]. These sEMG signals can be used as control inputs to drive the physiological subsystems that are governed by parameterized non-linear differential equations, and thus form the forward dynamics problem. Given information on muscle activations, the joint motion of a subject-specific MSK system can be obtained by solving a forward dynamics problem. Data-driven approaches for motion prediction have also been introduced to directly map the input sEMG signal to joint kinetics/kinematics, bypassing the forward dynamics equations and the need for parameter estimation [26,27,28,29,30]. However, the resulting ML-based surrogate models lack interpretability and may not satisfy the underlying physics. Another challenge is that the sEMG signal usually exhibits a wide range of frequencies that are non-trivial for ML models [1] to map to the joint motion.

In our previous work [1], a physics-informed parameter identification neural network (PI-PINN) was proposed for the simultaneous prediction of motion and parameter identification with application to MSK systems. Using the raw transient sEMG signals obtained from the sensors and the corresponding joint motion data, the PI-PINN learned a forward model to predict the motion with identifying the parameters of the hill-type muscle models representing the contractile muscle–tendon complex. A feature-encoded approach was introduced to enhance the training of the PI-PINN, which yielded high motion prediction accuracy and identified system parameters within a physiological range, with only a limited number of training samples. However, this method relies on mapping in a feature domain constituted by Fourier and polynomial bases, which requires the input sEMG signal to span over the entire duration of the motion. Thus, it prevents real-time predictions as the signal is obtained from the sensor.

To enhance the predictive accuracy of the time-dependent signals, recurrent neural networks (RNNs) such as gated recurrent units (GRUs) [29, 53] are utilized in this study to inform predictions with the history information of the motion. To overcome the limitation of the size of the data and provide more information from the composite frequency bands in the signals, a multi-resolution based (MR) approach is proposed. In this approach, wavelets are used to decompose both the raw sEMG and joint motion signals into coarse-scale components at various frequency scales and the remaining fine-scale details. Using principles of the multi-resolution theory and transfer learning, multi-resolution training processes are repeated recursively from the coarse-scale to the full-scale in order to map the sEMG signal to the joint motion. Furthermore, gaussian noise is introduced to the recorded motion data used for training to enhance the robustness and generalizability of the model [29]. The trained model can be applied for real-time motion predictions given the raw sEMG signal obtained from the sensor.

This manuscript is organized as follows. Section 2 introduces the subsystems and mathematical formulations of MSK forward dynamics, followed by an introduction of the proposed multi-resolution PI-RNN framework for simultaneous motion prediction and system parameter identification in Sect. 3. The following sections verify the proposed framework using synthetic data and validate it by modeling the elbow flexion–extension movement using subject-specific sEMG signals and recorded motion data in Sect. 4 and 5, respectively. Concluding remarks and future work are summarized in Sect. 6.

2 Formulations for muscle mechanics and musculoskeletal forward dynamics

This section provides a brief overview of muscle mechanics and forward dynamics of the human MSK system, with details in “Appendices A and B”. As depicted in Fig. 1, multiple subsystems within the MSK forward dynamics interact hierarchically: 1) the neural excitation \(u\left(t\right)\) transforms into muscle activation \(a\left(t\right)\) (activation dynamics); 2) Muscle activation drives muscle fibers to produce force \({F}^{MT}\) (muscle–tendon (MT) contraction dynamics); 3) the resultant forces produce joint motion q (translation and rotation) of MSK systems, called the MSK forward dynamics [18, 19].

Fig. 1
figure 1

The subsystems involved in the forward dynamics of an MSK system are depicted in this flowchart. Neural excitations are transmitted to muscle fibers (activation dynamics) that contract to produce force (muscle–tendon contraction dynamics). These forces generate torques at the joints (structural level MSK dynamics) leading to joint motion [1, 54]

2.1 Neural excitation-to-activation dynamics

While activations \(a(t)\) in the muscle fibers can be obtained through a non-linear transformation on neural excitations \(u\left(t\right)\), they are difficult to measure in-vivo. Therefore, the excitations are estimated from [15, 16] the raw sEMG signals \(e(t)\) considering an electro-mechanical delay:

$$u\left(t\right)=e\left(t-d\right)$$
(1)

where \(d\) measures the delay between the neural excitation originating and reaching the muscle group. The muscle activation signal \(a(t)\) is then expressed as,

$$a(t)=\frac{\mathrm{exp}\left(Au\left(t\right)\right)-1}{\mathrm{exp}\left(A\right)-1}$$
(2)

where \(A\) is a shape factor. These activations initiate muscle fiber contraction leading to force production from the muscle group (Fig. 2).

Fig. 2
figure 2

A muscle–tendon complex in the arm modelled by a homogenized hill-type model where muscle group’s in a are a homogenized muscle–tendon (MT) complex described by the model shown in b

2.2 Muscle–tendon force generation through contraction dynamics

Forces in the muscle–tendon (MT) complex are generated by the dynamics of MT contractions, where for structural length scale behaviour of the MT complex, homogenized hill-type muscle models are utilized (described in “Appendix B”). Each muscle group can be characterized by a parameter vector,

$${\boldsymbol{\kappa}}\boldsymbol{ }=\boldsymbol{ }\left[{l}_{0}^{M}, {v}_{max}^{M}, {f}_{0}^{M}, {l}_{s}^{T}, {\vartheta }_{0}\right]$$
(3)

containing constants such as the maximum isometric force in the muscle (\({f}_{0}^{M}\)), the optimal muscle length (\({l}_{0}^{M}\)) corresponding to the maximum isometric force, the maximum contraction velocity (\({v}_{max}^{M}\)), the slack length of the tendon (\({l}_{s}^{T})\), and the initial pennation angle (\({\vartheta }_{0})\) [18, 19]. The total force produced by the MT complex, \({F}^{MT}\), can be expressed as:

$${F}^{MT}(a,{\widetilde{l}}^{M},{\widetilde{v}}^{M},\vartheta ;{\boldsymbol{\kappa}})={F}^{M}\left(a,{\widetilde{l}}^{M},{\widetilde{v}}^{M};{\boldsymbol{\kappa}}\right)\mathrm{cos}\vartheta $$
(4)

where \(a\) is the activation function in Eq. (2), \({\widetilde{l}}^{M}\) is the normalized muscle length, \({\widetilde{v}}^{M}\) is the normalized velocity of the muscle and \(\vartheta \) is the current pennation angle. In this study, the tendon is assumed to be rigid \(\left({l}^{T}={l}_{s}^{T}\right)\) which simplifies the MT contraction dynamics [57, 58] accounting for the interaction of the activation, force length, and force velocity properties of the MT complex. More details can be found in “Appendices A and B”.

2.3 MSK forward dynamics of motion

Body movement is the result of the force produced by actuators (MT complexes), converted to torques at the joints of the body, leading to rotation and translation of joints, which are considered as the generalized degrees of freedom of an MSK system \(({\varvec{q}})\). The dynamic equilibrium can be expressed as

$${\varvec{I}}\left({\varvec{q}}\right)\boldsymbol{ }\ddot{{\varvec{q}}}-{{\varvec{T}}}^{MT}({\varvec{a}},{\varvec{q}},\dot{{\varvec{q}}};{\varvec{\kappa}})-{\varvec{E}}({\varvec{q}})={\mathbf 0}$$
(5)

where \({\varvec{q}},\boldsymbol{ }\dot{{\varvec{q}}},\boldsymbol{ }\ddot{{\varvec{q}}}\) are the vectors of generalized angular motions, angular velocities, and angular accelerations, respectively; \({\varvec{E}}({\varvec{q}})\) is the torque from the external forces acting on the MSK system, e.g., ground reactions, gravitational loads etc.; \({\varvec{I}}\left({\varvec{q}}\right)\) is the inertial matrix; \({{\varvec{T}}}^{MT}\) is the torque from all muscles in the model calculated by \({{\varvec{T}}}^{MT}({\varvec{a}},{\varvec{q}},\dot{{\varvec{q}}};{\varvec{\kappa}})={\varvec{R}}\left({\varvec{q}}\right){{\varvec{F}}}^{MT}({\varvec{a}},{\varvec{q}},\dot{{\varvec{q}}};{\varvec{\kappa}})\), where \({\varvec{R}}\left({\varvec{q}}\right)\) are the moment arm’s and \({{\varvec{F}}}^{MT}({\varvec{a}},{\varvec{q}},\dot{{\varvec{q}}};{\varvec{\kappa}})\) are the forces from the MT complex. Given the muscle activation signals \({\varvec{a}}\), initial conditions and parameters of involved muscle groups \({\varvec{\kappa}}\), the generalized angular motions \({\varvec{q}}\) and angular velocities \(\dot{{\varvec{q}}}\) of the joints can be obtained by solving Eq. (5). An example of these vectors is shown in Sect. 4 and “Appendix D”.

3 Multi-resolution recurrent neural networks for physics-informed parameter identification

This section describes the recurrent neural network algorithms, followed by the physics-informed parameter identification that enables the development of a forward dynamics surrogate and simultaneous parameter identification. The employment of multi-resolution analysis based on fast wavelet transform [59, 60] for training data augmentation is then defined. The computational framework for multi-resolution recurrent neural network for physics-informed parameter identification is also discussed.

3.1 Recurrent neural networks and gated recurrent units

The computational graph of a standard recurrent neural network (RNN) and its unfolded graph is shown in Fig. 3. The hidden state \({\varvec{h}}\) allows for RNNs to learn important history-dependent features from the data in sequential time steps [29, 53]. The unfolded graph shows the sharing of parameters across the architecture of the network, allowing for efficient training. The forward propagation of an RNN starts with an initial hidden state that embeds history-dependent features and propagates through all input steps. Considering an RNN with \(m\) history steps as shown in Fig. 3, the propagation of the hidden state can be expressed as follows [29].

Fig. 3
figure 3

Computational graph of a standard recurrent neural network using ‘m’ history steps for prediction

$$\begin{aligned}&{{\varvec{h}}}_{i}={a}_{tanh}\left({{\varvec{W}}}_{hh}{{\varvec{h}}}_{i-1}+{{\varvec{W}}}_{xh}{{\varvec{x}}}_{i}+{{\varvec{b}}}_{h}\right),\\& i=n-m,\dots ,n \end{aligned}$$
(6)

The hidden state at the final (current) step \(n\) is then used to inform the prediction.

$${\widehat{{\varvec{q}}}}_{n}={{\varvec{W}}}_{hq}{{\varvec{h}}}_{n}+{{\varvec{b}}}_{q}$$
(7)

Here, \({a}_{tanh}\) is the hyperbolic tangent function; \({{\varvec{W}}}_{xh}, {{\varvec{W}}}_{hh},\) and \({{\varvec{W}}}_{hq}\) are the trainable weight coefficients; \({{\varvec{b}}}_{h}\) and \({{\varvec{b}}}_{q}\) are the trainable bias coefficients. The trainable parameters are shared across all RNN steps. Let \({{\varvec{x}}}_{n}=\left[{t}_{n},{e}_{n}^{1},\dots ,{e}_{n}^{{N}_{a}}\right]\) be the current time and current sEMG data of the \({N}_{a}\) muscle components and \({\widehat{{\varvec{q}}}}_{n}\) be the predicted joint motions at the current time \({t}_{n}\). Figure 4a illustrates the computational graph of an RNN model trained to predict the motion at step \(n\) by using m history steps of \({\varvec{x}}\) and \({\varvec{q}}\) as well as the \({\varvec{x}}\) at step \(n\). The forward propagation is defined as

$$\begin{aligned}&{{\varvec{h}}}_{i}={a}_{tanh}\left({{\varvec{W}}}_{hh}{{\varvec{h}}}_{i-1}+{{\varvec{W}}}_{xh}{{\varvec{x}}}_{i}+{{\varvec{W}}}_{qh}{{\varvec{q}}}_{i}+{{\varvec{b}}}_{h}\right),\\&i=n-m,\dots ,n-1\end{aligned}$$
(8)
$${{\varvec{h}}}_{n}={a}_{tanh}\left({{\varvec{W}}}_{hh}{{\varvec{h}}}_{n-1}+{{\varvec{W}}}_{xh}{{\varvec{x}}}_{n}+{{\varvec{b}}}_{h}\right)$$
(9)
$${\widehat{{\varvec{q}}}}_{n}={{\varvec{W}}}_{h\widehat{q}}{{\varvec{h}}}_{n}+{{\varvec{b}}}_{q}$$
(10)

with trainable parameters including the weight coefficients \({{\varvec{W}}}_{hh},{{\varvec{W}}}_{xh},{{\varvec{W}}}_{qh}\) and \({{\varvec{W}}}_{h\widehat{q}}\) and bias coefficients \({{\varvec{b}}}_{h}\) and \({{\varvec{b}}}_{q}\). During training, the ‘teacher-forcing’ method is used where the measured motion data is given to the model in the history steps. In test mode, the model is fed back to the previous predictions as input to inform future predictions. The inputs received in this scenario could be quite different from those passed through in the training process, leading the network to make extrapolative predictions and therefore, accumulate errors which will pollute the predictions. To improve the testing performance and enhance model accuracy and robustness, a user-controlled amount of random Gaussian noise is added to the recorded motion data to introduce stochasticity so that the network can learn variable input conditions, resembling those in the test mode, see [29] for details.

Fig. 4
figure 4

An example computational graph of an RNN that uses one history step: a The train mode and b the test mode, where the motion predicted from the previous step is used as part of the input to predict motion at the current step

Standard RNNs, however, have difficulties in learning long-term dependencies due to vanishing and exploding gradient issues arising from the recurrent connections. To mitigate these issues, gated recurrent units (GRUs) have been developed [29, 53]. A standard GRU consists of a reset gate\({{\varvec{r}}}_{n}\), that removes irrelevant history information, an update gate \({{\varvec{u}}}_{n}\) that controls the amount of history information that is passed to the next step, and a candidate hidden state \({\widetilde{{\varvec{h}}}}_{n}\) that is used to calculate the current hidden state \({{\varvec{h}}}_{n}\). Considering a GRU with \(m\) history steps, the forward propagation can be expressed as follows [29]:

$$\begin{array}{l}{{\varvec{r}}}_{i}={a}_{\sigma }\left({{\varvec{W}}}_{hr}{{\varvec{h}}}_{i-1}+{{\varvec{W}}}_{xr}{{\varvec{x}}}_{i}+{{\varvec{W}}}_{qr}{{\varvec{q}}}_{i}+{{\varvec{b}}}_{r}\right)\\ {{\varvec{u}}}_{i}={a}_{\sigma }\left({{\varvec{W}}}_{hu}{{\varvec{h}}}_{i-1}+{{\varvec{W}}}_{xu}{{\varvec{x}}}_{i}+{{\varvec{W}}}_{qu}{{\varvec{q}}}_{i}+{{\varvec{b}}}_{u}\right)\\ {{\varvec{z}}}_{\left(i,i-1\right)}={{\varvec{r}}}_{i}\odot {{\varvec{W}}}_{h\widetilde{h}}{{\varvec{h}}}_{i-1}\\ {\widetilde{{\varvec{h}}}}_{i}={a}_{tanh}\left({{\varvec{z}}}_{\left(i,i-1\right)}+{{\varvec{W}}}_{x\widetilde{h}}{{\varvec{x}}}_{i}+{{\varvec{W}}}_{q\widetilde{h}}{{\varvec{q}}}_{i}+{{\varvec{b}}}_{\widetilde{h}}\right)\\ {{\varvec{c}}}_{\left(i,i-1\right)}={{\varvec{u}}}_{i}{\odot {\varvec{h}}}_{i-1}\\ {\widetilde{{\varvec{c}}}}_{\left(i,i\right)}={{\varvec{u}}}_{i}{\odot \widetilde{{\varvec{h}}}}_{i}\\ {{\varvec{h}}}_{i}={{\varvec{c}}}_{\left(i,i-1\right)}+{\widetilde{{\varvec{h}}}}_{i}-{\widetilde{{\varvec{c}}}}_{\left(i,i\right)}+{{\varvec{b}}}_{h}\\ \forall i=n-m,\dots ,n-1,\end{array}$$
(11)
$$\begin{array}{c}{{\varvec{r}}}_{n}={a}_{\sigma }\left({{\varvec{W}}}_{hr}{{\varvec{h}}}_{n-1}+{{\varvec{W}}}_{xr}{{\varvec{x}}}_{n}+{{\varvec{b}}}_{r}\right)\\ {{\varvec{u}}}_{n}={a}_{\sigma }\left({{\varvec{W}}}_{hu}{{\varvec{h}}}_{n-1}+{{\varvec{W}}}_{xu}{{\varvec{x}}}_{n}+{{\varvec{b}}}_{u}\right)\\ {{\varvec{z}}}_{\left(n,n-1\right)}={{\varvec{r}}}_{n}\odot {{\varvec{W}}}_{h\widetilde{h}}{{\varvec{h}}}_{n-1}\\ {\widetilde{{\varvec{h}}}}_{n}={a}_{tanh}\left({{\varvec{z}}}_{\left(n,n-1\right)}+{{\varvec{W}}}_{x\widetilde{h}}{{\varvec{x}}}_{n}+{{\varvec{b}}}_{\widetilde{h}}\right)\\ {{\varvec{c}}}_{\left(n,n-1\right)}={{\varvec{u}}}_{n}{\odot {\varvec{h}}}_{n-1}\\ {\widetilde{{\varvec{c}}}}_{\left(n,n\right)}={{\varvec{u}}}_{n}{\odot \widetilde{{\varvec{h}}}}_{n}\\ {{\varvec{h}}}_{n}={{\varvec{c}}}_{\left(n,n-1\right)}+{\widetilde{{\varvec{h}}}}_{n}-{\widetilde{{\varvec{c}}}}_{\left(n,n\right)}+{{\varvec{b}}}_{h}\end{array}$$
(12)
$${\widehat{{\varvec{q}}}}_{n}={{\varvec{W}}}_{h\widehat{q}}{{\varvec{h}}}_{n}+{{\varvec{b}}}_{\widehat{q}}$$
(13)

where \(\odot \) denotes the element-wise (Hadamard) product; \({a}_{\sigma }(\cdot )\) is the sigmoid activation function and \({a}_{tanh}\left(\cdot \right)\) is the hyperbolic tangent function; \({{\varvec{W}}}_{hr},{{\varvec{W}}}_{xr},{{\varvec{W}}}_{qr},{{\varvec{W}}}_{hu},{{\varvec{W}}}_{xu}, {{\varvec{W}}}_{qu},{\boldsymbol{ }\boldsymbol{ }\boldsymbol{ }{\varvec{W}}}_{h\widetilde{h}},{{\varvec{W}}}_{x\widetilde{h}},{{\varvec{W}}}_{q\widetilde{h}}\) and \({{\varvec{W}}}_{h\widehat{q}}\) are the trainable weight coefficients; \({{\varvec{b}}}_{r},{{\varvec{b}}}_{u},{{\varvec{b}}}_{\widetilde{h}},{{\varvec{b}}}_{h}\) and \({{\varvec{b}}}_{\widehat{q}}\) are the trainable bias coefficients. The current hidden state \({{\varvec{h}}}_{n}\) is calculated by a linear interpolation between the previous hidden state \({{\varvec{h}}}_{n-1}\) and the candidate hidden state \({\widetilde{{\varvec{h}}}}_{n}\), based on the update gate \({{\varvec{u}}}_{n}\). The model is trained via the backpropagation through time algorithm applied to RNNs [21]. Training occurs by plugging in the measured motion data in history steps (shown in Fig. 5), known as the teacher forcing procedure [21]. For predictions, the prediction from the previous step is used to predict the current step. The addition of gaussian noise to measured data, as described before, is adopted in GRU models as well.

Fig. 5
figure 5

An example computational graph of a GRU in train mode that uses one history step: a Starting with an initial or previously obtained hidden state \(\left({{\varvec{h}}}_{n-2}\right)\), the main GRU cell takes the input \({{\varvec{x}}}_{n-1}\) and motion \({{\varvec{q}}}_{n-1}\) that are used to obtain the GRU hidden state \({{\varvec{h}}}_{n-1}\) at step \(n-1\) (Eq. (11)) and, b where the hidden state \({{\varvec{h}}}_{n-1}\) is plugged back in to the GRU along with input \({{\varvec{x}}}_{n}\) at step \(n\) to predict the motion \({\widehat{{\varvec{q}}}}_{n}\) (Eq. (12)-(13)). The ‘ + ’ cell produces an output (arrow pointing outwards) that is the summation of the inputs (arrows pointing into the cell)

3.2 Simultaneous forward dynamics learning and parameter identification

With the governing equations for a general MSK forward dynamics (Sect. 2.1), the following parameterized ODE system is defined as.

$${\boldsymbol{\mathcal{L}}}\left[{\varvec{q}}\left(t\right);{\varvec{\lambda}}\right]={\varvec{s}}\left(t;{\varvec{\omega}}\right), \forall t\in (0,\mathrm{T}], {\boldsymbol{\mathcal{B}}}[{\varvec{q}}\left(0\right)]={\varvec{g}}$$
(14)

where the differential operator \({\boldsymbol{\mathcal{L}}}\left[(\cdot );{\varvec{\lambda}}\right]\) is parameterized by a set of parameters \({\varvec{\lambda}}\). The right-hand side \({\varvec{s}}\left(t;{\varvec{\omega}}\right)\) is parameterised by \({\varvec{\omega}}\). \({\boldsymbol{\mathcal{B}}}[(\cdot )]\) is the operator for initial conditions, and \({\varvec{g}}\) is the vector of prescribed initial conditions. To simplify notations, the ODE parameters are denoted by \(\boldsymbol{\Gamma }=\{{\varvec{\lambda}},{\varvec{\omega}}\}\). The solution to the ODE system \({\varvec{q}}:[0,T]\to {\mathbb{R}}\) depends on the choice of parameters \(\boldsymbol{\Gamma }\).

Here, an RNN is used to relate data inputs containing discrete sEMG signals and discrete time from all the \(m\) previous history time-steps of a trial, \({\cup }_{i=n-m}^{n}{{\varvec{x}}}_{i}\in {\mathbb{R}}^{{n}_{in}},m\in {\mathbb{Z}}^{+}\), to discrete joint motion data outputs at the current time-step, \({{\varvec{q}}}_{n}\in {\mathbb{R}}\), approximating the MSK forward dynamics. Let the training input at the \({i}^{th}\) history step be defined as \({{\varvec{x}}}_{i}=\left[{t}_{i},{e}_{i}^{1},\dots ,{e}_{i}^{{N}_{a}}\right]\), where \({t}_{i}\) denotes the time at the \({i}^{th}\) time step, and \({\left\{{e}_{i}^{j}\right\}}_{j=1}^{{N}_{a}}\) denotes the sEMG signals of \({N}_{a}\) muscle groups involved in the MSK joint motion at \({t}_{i}\). The motion at time step \(n\), is then predicted using the training input from all the previous \(m\) steps using the RNN.

$${\widehat{{\varvec{q}}}}_{n}\left({\varvec{\theta}}\right)={f}_{RNN}\left({{\varvec{x}}}_{n},{{\varvec{x}}}_{n-1},{{\varvec{q}}}_{n-1},\dots ,{{\varvec{x}}}_{n-m},{{\varvec{q}}}_{n-m};{\varvec{\theta}}\right)$$
(15)

where \({f}_{RNN}\) denotes RNN evaluations (depending on model chosen) discussed in Eq. (11)-(13). The optimal RNN parameters \(\widetilde{{\varvec{\theta}}}\) and the ODE parameters \(\widetilde{\boldsymbol{\Gamma }}\) are obtained by minimizing the composite loss function \(J\) as follows,

$$\widetilde{{\varvec{\theta}}},\widetilde{\boldsymbol{\Gamma }\boldsymbol{ }}=\underset{{\varvec{\theta}},\boldsymbol{\Gamma }}{\text{argmin}}\left(J\right)=\underset{{\varvec{\theta}},\boldsymbol{\Gamma }}{\text{argmin}}\left({ J}_{data}+{\beta J}_{res}\right)$$
(16)

where \(\beta \) is the parameter to regularize the loss contribution from the ODE residual term in the loss function and can be estimated analytically [1]. The data loss is defined by,

$${J}_{data}=\frac{1}{{N}_{data}}{\sum }_{\alpha =1}^{{N}_{data}}{\Vert {\widehat{{\varvec{q}}}}_{\alpha }\left({\varvec{\theta}}\right)-{{\varvec{q}}}_{\alpha }\Vert }_{{L}_{2}}^{2}$$
(17)

where \({\widehat{{\varvec{q}}}}_{\alpha }\left({\varvec{\theta}}\right)\) is the predicted motion, and \({{\varvec{q}}}_{\alpha }\) is the recorded motion of MSK joints. In addition to training an MSK forward dynamics surrogate, the proposed framework aims to simultaneously identify important MSK parameters from the training data by minimizing residual of the governing equation of MSK system dynamics in Eq. (5).

$${J}_{res}=\frac{1}{{N}_{data}}{\sum }_{\alpha =1}^{{N}_{data}}{\Vert {\varvec{r}}\left({\widehat{{\varvec{q}}}}_{\alpha }\left({\varvec{\theta}}\right);\boldsymbol{\Gamma }\right)\Vert }_{{L}_{2}}^{2},$$

with

$${\varvec{r}}\left({\widehat{{\varvec{q}}}}_{\alpha }\left({\varvec{\theta}}\right);\boldsymbol{\Gamma }\right)=\mathcal{L}\left[{\widehat{{\varvec{q}}}}_{\alpha }\left({\varvec{\theta}}\right);{\varvec{\lambda}}\right]-{\varvec{s}}\left({t}_{\alpha };{\varvec{\omega}}\right)$$
(18)

where \({\varvec{r}}\left({\widehat{{\varvec{q}}}}_{\alpha }\left({\varvec{\theta}}\right);\boldsymbol{\Gamma }\right)\) is the residual associated with Eq. (14) for the \({\alpha }^{th}\) sample; \(\boldsymbol{\Gamma }=\{{\varvec{\lambda}},{\varvec{\omega}}\}\) represents the ODE parameters relevant to the MSK system. The gradients of the network outputs with respect to the network parameters \(\left({\varvec{\theta}}\right)\), MSK parameters \(\left(\boldsymbol{\Gamma }\right)\), and inputs are needed in the loss function minimization in Eq. (16), which can be obtained efficiently by automatic differentiation [61]. The formulation in Eq. (15) is general such that more advanced RNN frameworks can be used such as the GRU described in Eq. (11)-(13).

3.3 Multi-resolution training with transfer learning

To improve the training efficiency of RNN for MSK applications with mixed-frequency sEMG input signals and low-frequency output joint motion, a multi-resolution decomposition of the training input–output data is introduced in Sect. 3.3.1, followed by the transfer learning based multi-resolution training protocols to be discussed in Sect. 3.3.2.

3.3.1 Wavelet based multi-resolution analysis

Consider a sequence of nested subspaces \(\dots \subset {V}_{-1}\subset {V}_{0}\subset {V}_{1}\subset \dots \subset {L}^{2}\left({\mathbb{R}}\right)\) where \({\bigcup }_{j\in \mathcal{Z}}{V}_{j}={L}^{2}\left({\mathbb{R}}\right)\), and \({\bigcap }_{j\in \mathcal{Z}}{V}_{j}=\mathrm{\varnothing }\). Each subspace \({V}_{j}\) of scale \([j]\) is spanned by a set of scaling functions \({\phi }_{j,k}(t)\), i.e.,\({V}_{j}=\left\{{\phi }_{j,k}\left(t\right)|{\phi }_{j,k}\left(t\right)={2}^{\frac{j}{2}}\phi \left({2}^{j}t-k\right),k\in \mathcal{Z}\right\}\)

Each subspace is related to the finer subspace through the law of dilation i.e., if \(\phi \left(t\right)\in {V}_{j},\) then \(\phi \left(2t\right)\in {V}_{j+1}, \forall j\in \mathcal{Z}\). Translations of the scaling function span the same subspace, i.e., if \(\phi \left(t\right)\in {V}_{j},\mathrm{ then }\phi \left(t-k\right)\in {V}_{j}, \forall j,k\in \mathcal{Z}\).

A mutually orthogonal complement of \({V}_{j}\) in \({V}_{j+1}\) is \({W}_{j}\), such that,

$${V}_{j+1}={V}_{j}\oplus {W}_{j}, \forall j\in \mathcal{Z}$$
(19)

where \(\oplus \) is a direct sum. This subspace \({W}_{j}\) is spanned by a set of wavelet functions \({\psi }_{j,k}(t)\), i.e.,\({W}_{j}=\left\{{\psi }_{j,k}\left(t\right)|{\psi }_{j,k}\left(t\right)={2}^{\frac{j}{2}}\psi \left({2}^{j}t-k\right), k\in \mathcal{Z}\right\}\)where \(\psi \left(t\right)\) is the mother wavelet. It follows that,

$${\oplus }_{j\in \mathcal{Z}}{W}_{j}={L}^{2}(\mathbb{R})$$
(20)

and therefore,

$${V}_{j}={V}_{i}\oplus \left({\oplus }_{k=0}^{j-i-1}{W}_{i+k}\right), j>i.$$
(21)

The two-scale dilation and translation relations for the scaling functions can be written as

$$\phi \left(t\right)=\sqrt{2}{\sum }_{k=-\infty }^{\infty }{d}_{k}\phi \left(2t-k\right).$$
(22)

Orthogonal wavelet functions can be obtained by imposing orthogonality conditions between scaling and wavelet functions in the frequency domain using Fourier transform,

$$\psi \left(t\right)=\sqrt{2}{\sum }_{k=-\infty }^{\infty }{\left(-1\right)}^{k-1}{d}_{-k-1}\phi \left(2t-k\right)$$
(23)

where \({d}_{k}\) is the coefficient.

Orthogonal scaling functions can be constructed by choosing a candidate function \({\phi }^{*}\left(t\right)\) such that \({\phi }^{*}\left(t\right)\) have reasonable decay and a finite support. In addition, \(\int {\phi }^{*}\left(t\right)dt\ne 0\). It should also satisfy the two-scale relation,

$${\phi }^{*}\left(t\right)={\sum }_{k}{p}_{k}{\phi }^{*}\left(2t-k\right), k\in \mathcal{Z}.$$
(24)

With these, an orthogonal scaling function \(\phi \left(t\right)\) can be expressed in terms of \({\phi }^{*}\left(t\right)\) as

$$\phi \left(t\right)={\sum }_{k=-\infty }^{\infty }{a}_{k}{\phi }^{*}\left(t-k\right).$$
(25)

It is then possible to define the scaling function at the coarse scale in terms of the scaling function at the fine scale and the wavelet functions at the coarser scale,

$$\phi \left(2t-l\right)={\sum }_{k=-\infty }^{\infty }{d}_{l-2k}\phi \left(t-k\right)+{\sum }_{k=-\infty }^{\infty }{h}_{l-2k}\psi \left(t-k\right), l\in \mathcal{Z}.$$
(26)

Any function can be approximated at scale \([j]\) by using \({\phi }_{j,k}\) as a basis as well as using its coarse scale \([j-1]\) representation and details at the coarse scale, i.e.,

$$\begin{aligned}{P}_{j}f&={\sum }_{k=-\infty }^{\infty }{S}_{k}^{[j]}{\phi }_{j,k}={P}_{j-1}f +{H}_{j-1}f={\sum }_{k=-\infty }^{\infty }\nonumber \\ &{S}_{k}^{[j-1]}{\phi }_{j-1,k}+{\sum }_{k=-\infty }^{\infty }{T}_{k}^{[j-1]}{\psi }_{j-1,k}\hfill \\ \end{aligned}$$
(27)

where \({P}_{j}\) and \({H}_{j}\) are the operators projecting \(f\) onto the subspaces \({V}_{j}\) and details of \(f\) at scale \([j]\) in the orthogonal subspace \({W}_{j}\), respectively. \({S}_{k}^{[j]}\) and \({T}_{k}^{[j]}\) are the corresponding basis coefficients at the coarse scale \([j]\). While the example shown here is for a one-dimensional case, this multi-resolution representation can be extended to multi-dimensions.

3.3.2 Multi-resolution data representation and training protocols

In this approach, a given signal \(f\left(t\right)\) is represented using the multi-resolution scaling functions and wavelets. A scale \([j]\) representation of signal \(f\left(t\right)\) can be obtained from the scale \([r] (j>r)\) representation with the addition of wavelet components (high frequency components) of the scales higher than \([r]\), using the discrete wavelet transform modified from Eq. (27),

$$\begin{aligned}{P}_{j}f\left(t\right)&={P}_{r}f\left(t\right) +{\sum }_{b=r}^{j-1}{H}_{b}f(t)={\sum }_{k=-\infty }^{\infty }{S}_{k}^{[r]}{\phi }_{r,k}\left(t\right)\\ &\quad +{\sum }_{b=r}^{j-1}{\sum }_{k=-\infty }^{\infty }{T}_{k}^{[b]}{\psi }_{b,k}\left(t\right)\end{aligned}$$
(28)

where \({P}_{r}\) is the projection operator at scale \([r]\) and \({H}_{b}\) are the wavelet projectors of the signal that are added from scale \(\left[r\right]\) to scale \([j-1]\) to reconstruct the signal at scale \([j]\); \({S}_{k}^{[r]}\) and \({T}_{k}^{[b]}\) are the scaling and wavelet function’s coefficients, obtained by the orthogonality condition as given in Sect. 3.3.1.

Using the Wavelet transform to represent a time series under multiple resolutions offers advantages for feature extraction from signals. Compared to the Fourier transform which offers only localization in the frequency domain, the Wavelet transform provides both frequency and time domain localization, making it more suitable for time history (or sequence) learning algorithms such as the standard RNN and its enhanced variant GRU. More specifically, one can enhance training efficiency by using a sequential training strategy for the time-history input (sEMG) and output (joint motion) data. Applying the Fast Wavelet Transform [59, 60] to obtain the input and output data from low to high resolutions results in better generalization performance of the RNN trained to map from sEMG signals to joint motion time history as described below. The second order Daubechies wavelets are used in this work.

Here we consider a general MSK system described in Sect. 2. The original unfiltered data is denoted as scale [0], which will be decomposed into a sequence of lower scales \(\left[-j\right], j\in {\mathbb{Z}}^{+}\) for multi-resolution training.

Let \({{\varvec{D}}}^{[0]}\) be the input training data at the full-scale \(\left(j=0\right)\) of the raw signals i.e.,

$$\begin{aligned}&{{\varvec{D}}}^{[0]}=\left[{{\varvec{x}}}_{1}^{\left[0\right]},{{\varvec{x}}}_{2}^{[0]},\dots ,{{\varvec{x}}}_{{N}_{data}}^{[0]}\right],\\&{{\varvec{x}}}_{i}^{[0]}=\left[{t}_{i},{e}_{i}^{1[0]},\dots ,{e}_{i}^{{N}_{a}[0]}\right] .\end{aligned}$$
(29)

and the motion of joints of the MSK system at the \({i}^{th}\) time-step at the full-scale \(\left(j=0\right)\) is \({{\varvec{q}}}_{i}^{\left[0\right]}\) such that the array of the unfiltered motion data for the duration of the motion is \({{\varvec{q}}}^{\left[0\right]}=\left[{{\varvec{q}}}_{1}^{\left[0\right]},{{\varvec{q}}}_{2}^{\left[0\right]},\dots, {{\varvec{q}}}_{{N}_{data}}^{\left[0\right]}\right]\).

From MR theory, subtracting details from the fine scale representations at the full-scale of the signal, i.e., \([0]\), results in a course scale representation of the signal at scale \(\left[-k\right], k=1,\dots j\). The projected training data at coarse scale [-j] is defined as

$${{\varvec{D}}}^{[-j]}=\left[{{\varvec{x}}}_{1}^{\left[-j\right]},{{\varvec{x}}}_{2}^{\left[-j\right]},\dots ,{{\varvec{x}}}_{{N}_{data}}^{\left[-j\right]}\right]$$
(30)

where \({N}_{data}\) is the total number of data points and

$${{\varvec{x}}}_{i}^{\left[-j\right]}=\left[{t}_{i},{e}_{i}^{1\left[-j\right]},\dots ,{e}_{i}^{{N}_{a}\left[-j\right]}\right], {i}\hspace{0.17em}=\hspace{0.17em}1\dots , {N}_{data}$$
(31)

is the input data of scale [-j] at time step \(i\). The motion of the MSK joints at the \({i}^{th}\) time-step at the scale \([-j]\) is \({{\varvec{q}}}_{i}^{\left[-j\right]}.\) The data sets for a representative muscle group ‘\(MT\)’, \({e}_{i}^{MT[-j]}\) and motion \({{\varvec{q}}}_{i}^{\left[-j\right]}\), are obtained from the original raw data \({e}_{i}^{MT[0]}\) and \({{\varvec{q}}}_{i}^{\left[0\right]}\) by wavelet projection using Eq. (27), that is,

$$\begin{aligned}{{{e}^{MT[-j]}\left(t\right)\equiv P}_{j}e}^{MT[0]}\left(t\right)&={{P}_{j-1}e}^{MT[0]}\left(t\right)\\ &\quad +{H}_{j-1}{e}^{MT[0]}\left(t\right)\\ &={e}^{MT[0]}\left(t\right)-{\sum }_{b=0}^{j-1}{H}_{b}{e}^{MT[0]}(t),\end{aligned}$$
$${{{{\varvec{q}}}^{[-j]}\left(t\right)\equiv {\varvec{P}}}_{j}{\varvec{q}}}^{[0]}\left(t\right)={{{\varvec{P}}}_{j-1}{\varvec{q}}}^{[0]}\left(t\right)+{{\varvec{H}}}_{j-1}{{\varvec{q}}}^{[0]}\left(t\right)={{\varvec{q}}}^{[0]}\left(t\right)-{\sum }_{b=0}^{j-1}{{\varvec{H}}}_{b}{{\varvec{q}}}^{[0]}(t)$$
(32)

where \({{\varvec{P}}}_{j}\) and \({{\varvec{H}}}_{j}\) are the projection operators in multi-dimensions. Hence, datasets that contain lower resolution representations of the original signal at scales \([0]\) can be expressed as:

$$\begin{aligned}&{{\varvec{D}}}^{\left[-j\right]}\subset {{\varvec{D}}}^{\left[-j+1\right]}\subset \dots {{\varvec{D}}}^{\left[-1\right]}\subset {{\varvec{D}}}^{\left[0\right]}, \\&{{\varvec{q}}}^{\left[-j\right]}\subset {{\varvec{q}}}^{\left[-j+1\right]}\subset \dots {{\varvec{q}}}^{\left[-1\right]}\subset {{\varvec{q}}}^{\left[0\right]} \end{aligned}$$
(33)

where \({{\varvec{q}}}^{\left[-j\right]}=\left[{{\varvec{q}}}_{1}^{\left[-j\right]},{{\varvec{q}}}_{2}^{\left[-j\right]},\dots ,{{\varvec{q}}}_{{N}_{data}}^{\left[-j\right]}\right]\).

Instead of learning the signal mapping from input original raw sEMG data \({{\varvec{D}}}^{\left[0\right]}\) to motion data \({{\varvec{q}}}^{\left[0\right]}\), we initiate learning the mapping by starting from a coarse scale representation of the input–output data at scale \([-j]\) and map \({{\varvec{D}}}^{\left[-j\right]}\) to \({{\varvec{q}}}^{\left[-j\right]}\). For multi-resolution RNN, the initial learning starts from the coarsest scale \([-j]\) as follows:

$${{\varvec{h}}}_{i}^{\left[j\right]}={a}_{tanh}\left({{{\varvec{W}}}_{hh}^{[-j]}{\varvec{h}}}_{i-1}^{[-j]}+{{\varvec{W}}}_{xh}^{[-j]}{{\varvec{x}}}_{i}^{[-j]}+{{\varvec{W}}}_{qh}^{[-j]}{{\varvec{q}}}_{i}^{\left[-j\right]}+{{\varvec{b}}}_{h}^{[-j]}\right)\forall i=n-m,\dots ,n-1\hfill \\ $$
(34)
$${{\varvec{h}}}_{n}^{[-j]}={a}_{tanh}\left({{{\varvec{W}}}_{hh}^{[-j]}{\varvec{h}}}_{n-1}^{[-j]}+{{\varvec{W}}}_{xh}^{[-j]}{{\varvec{x}}}_{n}^{[-j]}+{{\varvec{b}}}_{h}^{[-j]}\right)$$
(35)
$${\widehat{{\varvec{q}}}}_{n}^{\left[-j\right]}={{{\varvec{W}}}_{h\widehat{q}}^{[-j]}{\varvec{h}}}_{n-1}^{[-j]}+{{\varvec{b}}}_{q}^{[-j]}$$
(36)

At the next finer scale \([-j+1]\), the weights at scale \([-j]\) (using an early stopping [62]) are used as the initial values for \({{\varvec{W}}}_{hh}^{[-j+1]}\), \({{\varvec{W}}}_{xh}^{[-j+1]}\), \({{\varvec{W}}}_{qh}^{[-j+1]}\), \({{\varvec{W}}}_{h\widehat{q}}^{[-j+1]}{{\varvec{b}}}_{h}^{[-j]}{{\varvec{b}}}_{q}^{[-j]}\), similar to the concept of transfer learning[63].

Similarly, for multi-resolution GRU, the initial learning starts from the coarsest scale \([-j]\) as described in “Appendix C”. The same procedures to transfer the NN parameters in Eq. (34)-(36) are repeated with \([-j]\to [-j+1]\) until it reaches scale [0]. To enhance model accuracy and robustness, variations based on Gaussian noise are added to the motion data in each sequential step, as suggested by [29]. The sequential MR training process is described in Algorithm 1.

figure a

4 Verification example

For verification of the proposed MR PI-RNN framework, an elbow flexion–extension model [1] and synthetic sEMG signals with Gaussian noise and associated motion responses were considered. The flowchart of the proposed computational framework for simultaneous forward dynamics prediction and parameter identification of MSK parameters is shown in Fig. 6.

Fig. 6
figure 6

An overview of the application of this framework to the recorded motion data. The location of motion capture markers is circled in red and the sEMG sensors on Biceps and Triceps muscle groups in blue and green, respectively. The simplified rigid body model was used in the forward dynamics equations within the framework with appropriately scaled anthropometric properties (for geometry) and physiological parameters (for muscle–tendon material models). The raw sEMG signals were mapped to the target angular motion of the elbow and used to simultaneously characterize the MSK system using the proposed Multi-Resolution PI-RNN framework

The model contained two rigid links corresponding to the upper arm and forearm with lengths \({l}_{ua}\) and \({l}_{fa}\), respectively. They were connected at a hinge resembling the elbow joint “A”, while the upper arm link was fixed at the top joint “B”, and the biceps (Bi) and triceps (Tri) muscle–tendon complexes (modeled by Hill-type models with parameters \({{\varvec{\kappa}}}_{Bi}\) and \({{\varvec{\kappa}}}_{Tri}\)) were represented by the lines connecting the links, as shown in Fig. 6. The degree of freedom of the model was the elbow flexion angle \(q\). The mass in the forehand was assumed to be concentrated at the wrist location, hence, a mass \({m}_{fa}\) was attached to one end of the forearm link with a moment arm \({l}_{fa}\) from the elbow joint. Tendons were assumed as rigid [58] for ease of computation.

The equation of motion for this rigid body system is given in “Appendix D”. Given the synthetic sEMG signals (\({e}^{Bi}(t),{e}^{Tri}(t)\)), the initial conditions \(q\left(0\right)=\frac{\pi }{6} \mathrm{radians}\) and \(\dot{q}\left(0\right)=0 \; \mathrm{radians}/\mathrm{sec}\) and the parameters in Table 1, the motion of the elbow joint, \(q\), can be obtained by solving the MSK forward dynamics problem using an explicit Runge–Kutta scheme, implemented in Python’s SciPy library [64].

Table 1 Parameters involved in the forward dynamics setup of elbow flexion–extension motion

To verify and check the robustness of the MR framework to different levels of noise in the input, the following test was performed. Originally, five synthetic samples i.e., Trial's 1 to 5, of noiseless synthetic muscle sEMG signals are assumed, as shown in Fig. 7. In practical applications, signals obtained from measurement devices such as sEMG sensors contain noise in their content. Therefore, three cases were developed by adding Gaussian noise \(\left(\mathcal{N}\left(\mu ,\sigma \right)\right)\) with zero mean \(\left(\mu =0\right)\) and increasing levels of standard deviations \(\left(\sigma \right)\) to the input synthetic sEMG signals as mentioned in Table 2. As the maximum value of the noiseless sEMG signals is 1, the chosen \(\sigma \)'s were kept within 10%—20% of the signal maximum for a reasonable level of noise. Restricting it between 10%—20% is a choice as having higher noise levels (> 20%) would dominate over the underlying ‘noiseless’ periodic sinusoidal signal, leading to non-physiological synthetic sEMG signals. The corresponding output motions are generated by passing the noisy sEMG as input to the FD equations in Sect. 2. The following training procedures were performed for each of the three cases.

Fig. 7
figure 7

The original ‘noiseless’ input–output data set with the synthetic biceps and triceps sEMG signals having variations in frequency for five trials are shown at the top. Increasing levels of noise are added to develop 3 cases of synthetic mixed frequency input sEMG, from which corresponding output motions are solved, using the forward dynamics equations. To verify the MR framework, these three cases with their respective mixed frequency input data are then mapped to their corresponding motion data

Table 2 Input data and gaussian noise level for each case

4.1 1-scale training

The mixed frequency input sEMG signals and corresponding output motion data \({\varvec{q}}\) at scale [0], denoted by \({{\varvec{D}}}^{\left[0\right]}\) and \({{\varvec{q}}}^{\left[0\right]}\), respectively, are mapped to get a baseline performance. This is termed as 1-scale training as only the full-scale (i.e., [0]) of the mixed frequency data is used for training.

4.2 2-scale training

  1. a.

    Initiate learning from a coarse scale representation of the mixed frequency input data at scale \([-1]\) and map \({{\varvec{D}}}^{\left[-1\right]}\) to the corresponding motion data at scale [-1], \({{\varvec{q}}}^{\left[-1\right]}\), of that case.

  2. b.

    Transfer parameters to the next scale training and finish the learning by mapping \({{\varvec{D}}}^{\left[0\right]}\) to \({{\varvec{q}}}^{\left[0\right]}\).

4.3 3-scale training

  1. a.

    Start learning from a coarse scale representation of the mixed frequency input data at scale \([-2]\) and map \({{\varvec{D}}}^{\left[-2\right]}\) to the corresponding motion data at scale [-2], \({{\varvec{q}}}^{\left[-2\right]}\), of that case.

  2. b.

    Transfer parameters to the next scale training and continue learning by mapping \({{\varvec{D}}}^{\left[-1\right]}\) to \({{\varvec{q}}}^{\left[-1\right]}\).

  3. c.

    Transfer parameters to the next scale training and finish the learning by mapping \({{\varvec{D}}}^{\left[0\right]}\) to \({{\varvec{q}}}^{\left[0\right]}\).

For each case and for each of the training scales in that case, the training data samples contained the data of trial’s 1, 2, 4, 5 while trial 3 was used for testing, each trial with \(n=\) 500 data points. The MSK parameters \(\boldsymbol{\Gamma }={\left\{{\Gamma }_{l}\right\}}_{l=1}^{4}=\left\{{f}_{0,Bi}^{M},{l}_{0,Bi}^{M},{f}_{0,Tri}^{M},{l}_{0,Tri}^{M}\right\}\) were chosen to be identified from the training data using the proposed framework. Due to differences in units and physiological nature of the parameters, the conditioning of the parameter identification system could be affected. To mitigate this issue, normalization [1, 44] was applied to each of the parameters,

$${\overline{\Gamma } }_{{\varvec{l}}}=\frac{{\Gamma }_{l}}{{\Gamma }_{l}^{\left(0\right)}}$$
(37)

where \({\Gamma }_{l}^{\left(0\right)}\) was the initial value of the parameter. Therefore, the parameters to be identified became \(\overline{\boldsymbol{\Gamma } }={\left\{{\overline{\Gamma } }_{l}\right\}}_{l=1}^{4}\).

The proposed framework, as described in Sect. 3, was applied to each case to simultaneously learn the MSK forward dynamics surrogate and identify the MSK parameters \(\overline{\boldsymbol{\Gamma } }\) by optimizing Eq. (16), where the residual of the governing equation for the current time step \(k\), was expressed as

$$r\left({\widehat{q}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right),{\dot{\widehat{q}}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right),{\ddot{\widehat{q}}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right);\boldsymbol{\Gamma }\left(\overline{\boldsymbol{\Gamma } };{\boldsymbol{\Gamma }}^{(0)}\right)\right)=I{\ddot{\widehat{q}}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right)-E\left({\widehat{q}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right)\right)-{T}^{MT}\left({a}_{Bi}\left({t}_{k}\right),{a}_{Tri}\left({t}_{k}\right),{\widehat{q}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right),{\dot{\widehat{q}}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right);\boldsymbol{\Gamma }\left(\overline{\boldsymbol{\Gamma } };{\boldsymbol{\Gamma }}^{\left(0\right)}\right)\right)$$
(38)

and is included in the residual term \({J}_{res}\) in the loss function in Eq. (16). While the training happens sequentially from coarse to fine-scales of the motion, the final identification of parameters happens at the scale \([0]\), i.e., the full-scale in each of the 1-, 2- and 3-scale MR training types.

A GRU with 2 history steps, 1 hidden layer and 50 neurons in each layer was used. The training was performed after standardizing the data, such that the scale or range of the input and output have minimal influence on model performance [21]. The Adam algorithm [65] was used with an initial learning rate of \(1{\times 10}^{-3}\) and the penalty parameter for the MSK residual term in the loss function, \(\beta \propto \frac{\Delta {t}^{2}}{I}={10}^{-3}\). \(\Delta t\) is the time-step between data points and \(I\) is the moment of inertia in Eq. (38). Five parameter initialization seeds were used for an averaged response of the MR training.

To compare the post-training performance of 1-, 2- and 3-scale MR training’s, the average testing mean squared error (MSE) and testing \({\mathrm{R}}^{2}\) scores were compared, where these measures for a single trial are defined as:

$$\mathrm{MSE}=\frac{1}{n}{\Vert \begin{array}{c}{\varvec{q}}-\widehat{{\varvec{q}}}\end{array}\Vert }_{{L}_{2}}^{2}$$
(39)
$${\mathrm{R}}^{2}=1-\frac{{\sum }_{i=1}^{n}{\left({q}_{i}-{\widehat{q}}_{i}\right)}^{2}}{{\sum }_{i=1}^{n}{\left({q}_{i}-\overline{q }\right)}^{2}}$$
(40)

where \({\varvec{q}}\) is the motion data of the trial, \(\widehat{{\varvec{q}}}\) is the trial’s predicted motion from the MR PI-RNN framework, and \(\overline{q }\) is the mean of trial’s motion data with \(n\) being the number of data points in the trial. At each epoch in the MR training, the training loss is calculated by using the scale of the training data used in that training scale, i.e., scale \([-j]\) of the data is used in \(j\)-scale training.

The gradual improvement in these metrics is evident from Fig. 8 where, as further scales of information are added and the training data is augmented, the generalization performance shows improvement from 1-scale to 3-scale. Overall, it is noted that the test metrics such as the MSE reduces, and the \({\mathrm{R}}^{2}\) score gets closer to one, indicating an increase in the generalization accuracy as more training scales are introduced. This can be explained through the theory of bias-variance tradeoff; training on various scales of the data introduces more variance to the training, helping the ML framework to reduce the bias it develops by just training on the full-scale of the data. Together, this reduction in bias and growth in variance leads to a better generalization performance. Computationally, this method improves accuracy in the same amount of training epochs showing the efficiency of this method. As generalization predictions post-training are made using the full-scale of the data, there is no increase in time needed to perform the forward pass for any scale.

Fig. 8
figure 8

The training loss and testing metrics are shown. The zoomed-in plots are included for clarity on the loss evolution for the last 500 epochs. The shaded area indicates one standard deviation from the mean (solid line), in both the loss and average test MSE and \({\mathrm{R}}^{2}\) score figures. As more scales of data are introduced in the MR training, the average Test MSE and \({\mathrm{R}}^{2}\) score calculated post-training improve in each case

Meanwhile, the MSK parameters, \({f}_{0}^{M}\) (maximum isometric force) and \({l}_{0}^{M}\) (optimal muscle length corresponding to the maximum isometric force), of both the biceps and the triceps were accurately identified from the motion data, as shown in Table 3. Compared with the parameter identification from our previous work [1] where in addition to \({f}_{0}^{M}\), the maximum contraction velocity \({v}_{max}^{M}\) was independently identified, due to non-convergence of \({l}_{0}^{M}\) by the time-domain and feature-encoded trainings, the proposed method can accurately identify \({l}_{0}^{M}\). \({v}_{max}^{M}\) can then by obtain by the experimentally observed relationship of \({v}_{max}^{M}/{l}_{0}^{M}=10 {\mathrm {sec}}^{-1}\) [57, 66].

Table 3 The average percentage error (shown as mean \(\pm \) standard deviation) between predicted and true values of the parameters for 3-scale training for each case from five initialization points

For the identification of optimal muscle length parameters \(\left({l}_{0}^{M}\right)\), the initial points need to be chosen with respect to constraints applied by the geometry of the MSK system. The errors reported in Table 3 are calculated by taking the average of the percentage error of the identified MSK parameters from the 3-scale training with the multiple parameter \(({\varvec{\theta}},\boldsymbol{\Gamma })\) initializations. It was observed that in MSK parameter identification, similar accuracy in characterization was obtained from all training scale approaches used within each case, with errors less than 1%. This indicates that the MR PI-RNN improves the generalization performance of the motion prediction, without loss in parameter identification accuracy. It is noted that this example investigates the predictivity of in-distribution testing data, i.e., testing data that lies within the range of the training data. The effect of MR PI-RNN training on out-of-distribution predictivity is also studied in “Appendix E”.

5 Validation: elbow flexion–extension motion

5.1 Application of MR PI-RNN to subject-specific data

The recorded motion data and sEMG signals were collected and processed as per the data acquisition protocols mentioned in [1]. In brief, three elbow flexion–extension motion trials were performed by the subject for 10 s each, with two Delsys Trigno sEMG sensors placed on the biceps and triceps muscle groups, based on SENIAM recommendations [67]. The processed sEMG signals were transformed as described in Sect. 2.1 to obtain muscle activation signals, used to calculate the MSK forward dynamics ODE residual. The same simplified rigid body model was used as in Sect. 4 and appropriately scaled anthropometric properties (for the geometry of the model) and physiological parameters (for muscle–tendon material models used for the muscle groups) based on the generic upper body model defined in [68, 69] were used. Figure 9 shows the measured data of the three trials, including the transient raw sEMG signals and the corresponding angular motion of the elbow flexion–extension of the subject.

Fig. 9
figure 9

The measured raw sEMG signals and the corresponding angular motion of the elbow flexion–extension of the subject are plotted

In this example, the raw sEMG signals were used as input. A 5-scale MR training procedure as described in Sect. 4 was used on a GRU with 1 hidden layer with 50 neurons. The data of trials 1 and 3 were used for training, while trial 2 was used for testing, where each signal contained 500 temporal data points.

The muscle parameters to be identified by the framework include the maximum isometric force and the optimal muscle length from both muscle groups, which are denoted as \(\boldsymbol{\Gamma }=\left\{{f}_{0,Bi}^{M},{l}_{0,Bi}^{M},{f}_{0,Tri}^{M},{l}_{0,Tri}^{M}\right\}\). It was observed in our tests that despite the normalization process described in Eq. (37) and (38), the parameters obtained at the end of the MR training with motion data either diverged or converged to non-physiological values. To obtain physiologically consistent parameters, we use the values obtained from literature studies and constrain the space of parameter search [44].

Let the parameter to be identified be defined as

$${\Gamma }_{l}\left({\varvec{\psi}}\right)=\frac{1}{N}{\sum }_{r=1}^{N}{\overline{\gamma }}_{r}\mathrm{sig}\left({\psi }_{r}\right),{\varvec{\psi}}=\left[{\psi }_{1},{\psi }_{2},\dots ,{\psi }_{N}\right]$$
(41)

where \({\overline{\gamma }}_{r}\) is the value defined in the \({r}^{th}\) literature study and \({\psi }_{r}\) is the parameter to be optimized in the training such that it can be used to evaluate the sigmoid function \(\mathrm{sig}\left({\psi }_{r}\right)\) and \({\varvec{\psi}}\) is the vector of these trainable parameters. Using the optimized \({\varvec{\psi}}\), the desired MSK parameters can be estimated. This formulation constrains the identified parameters to be consistent with parameters obtained through experimental studies [68,69,70].

The proposed framework was then applied to simultaneously learn the MSK forward dynamics surrogate and identify the MSK parameters \(\boldsymbol{\Gamma }\) by optimizing Eq. (16), where the residual of the governing equation for the current time step \(k\), is expressed with a slight correction of Eq. (38) due to parameter space modification as

$$r\left({\widehat{q}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right),{\dot{\widehat{q}}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right),{\ddot{\widehat{q}}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right);\boldsymbol{\Gamma }\left({\varvec{\psi}}\right)\right)=I{\ddot{\widehat{q}}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right)-E\left({\widehat{q}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right)\right)-{T}^{MT}\left({a}_{Bi}\left({t}_{k}\right),{a}_{Tri}\left({t}_{k}\right),{\widehat{q}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right),{\dot{\widehat{q}}}_{k}^{\left[-j\right]}\left({{\varvec{\theta}}}_{{\varvec{q}}}\right);\boldsymbol{\Gamma }\left({\varvec{\psi}}\right)\right)$$
(42)

This is introduced into the residual term \({J}_{res}\) in the loss function and the optimization problem becomes,

$$\widetilde{{\varvec{\theta}}},\widetilde{{\varvec{\psi}}}=\underset{{\varvec{\theta}},{\varvec{\psi}}}{\text{argmin}}\left({J}_{data}\left({\varvec{\theta}}\right)+\beta {J}_{res}\left({\varvec{\theta}},{\varvec{\psi}}\right)\right).$$
(43)

As mentioned in the verification example (Sect. 4), the multi-resolution parameter identification is performed starting from the coarsest scale, transferring the learned hyperparameters to the next finer scale parameter identification, and finally completing the parameter identification at the full-scale, i.e., at scale \([0]\).

5.2 Results

To accelerate the training process, the training dataset is standardized to have zero mean and unit variance. The training was performed with the standardized data, using the Adam algorithm [65] with an initial learning rate of \(1{\times 10}^{-3}\) and 4 history steps were considered. Five parameter initialization seeds were used for an averaged response of the MR training. To quantify the error in the testing predictions, a normalized mean squared error (NMSE) was defined,

$$\mathrm{NMSE}=\frac{1}{n}\frac{{\sum }_{i=1}^{n}{\left({q}_{i}-{\widehat{q}}_{i}\right)}^{2}}{{\sum }_{i=1}^{n}{\left({q}_{i}-\overline{q }\right)}^{2}}$$
(44)

where \({q}_{i}\) is the \({i}^{th}\) target motion data point, \(\widehat{{q}_{i}}\) is the \({i}^{th}\) predicted motion data point, from the MR PI-RNN framework, and \(\overline{q }\) is the mean of target motion data. The \({\mathrm{R}}^{2}\) score was calculated using the metric defined in Eq. (40). From Fig. 10, Fig. 11 and Table 4, it is clear that addition of training scales leads to improved motion predictions. The multi-resolution training leads to an increase in average test \({\mathrm{R}}^{2}\) score of more than 40% (bringing it closer to one), averaged over the multiple initialization seeds. With the addition of more scales, Fig. 10 clearly shows the progression in improvement of the predictions as more scales are involved in the training.

Fig. 10
figure 10

Comparison of test predictions post-training for each MR training scale performed. The solid dash line is the mean of the predictions post-training when various initialization points are utilized to begin the MR training, with shaded region indicating one standard deviation from the mean

Fig. 11
figure 11

The test normalized mean squared error (NMSE) and test \({\mathrm{R}}^{2}\) score are plot for the testing predictions post-training, averaged over five initialization seeds. The mean of the metric is the solid marker line, the shaded portion being one standard deviation from the mean

Table 4 The test metrics such as NMSE and \({\mathrm{R}}^{2}\) score averaged over five initialization seeds, for the various training scales involved are reported here

The identified MSK parameters from the MR PI-RNN training are summarized in Table 5 with the mean of the final converged values of \({f}_{0}^{M}\) and \({l}_{0}^{M}\) obtained from multiple parameter initializations at 4-scale training, consistent with the physiological estimates of these parameters reported in literature [68,69,70]. \({l}_{0,Bi}^{M}\) is slightly outside the estimated range, which could be attributed to the variance in population. Similar values were obtained across all scales of training hence parameters obtained from a representative 4-scale training are shown here. The results demonstrated the effectiveness of the proposed MR PI-RNN framework and promising potential for real applications.

Table 5 The identified parameter estimates using MR PI-RNN training, and their values reported in literature [68,69,70]

6 Discussion and conclusions

In this work, we proposed a multi-resolution physics-informed recurrent neural network (MR PI-RNN) for an application to MSK systems, for time-domain motion prediction and parameter identification. A GRU with a physics-informed loss function that minimized the error in the training data and the residual of the MSK forward dynamics equilibrium was used for this purpose. Wavelet based multi-resolution techniques were used to decompose the input sEMG signals and output joint motion data into coarse-scale approximations at different scales and fine-scale details at those scales. The sEMG and joint motion multi-scale components were then mapped to each other starting from a chosen coarse-scale components and then sequentially trained (via transfer learning) to higher scales, completing the training on the full-scale of the data.

By initializing training on the coarse-scale of the training data, the optimization reaches a local minimum that serves as a better initialization state for the training data that includes the sequential fine-scale details. The proposed transfer-learning based sequential training scheme can be used for learning datasets that have high frequency signals as shown in the verification example with synthetic mixed frequency sEMG data. The numerical examples show an improvement in testing prediction and identifying the parameters. We observe from the loss profiles that the testing loss decreased while the training loss increased as more scales of data were brought in. It was also observed that the average test MSE and \({\mathrm{R}}^{2}\) metrics showed a clear improvement in the generalization accuracy. These phenomena can be explained through the theory of bias-variance tradeoff; training on various scales of the data introduces more variance to the training, helping the ML framework to reduce the bias it develops by just training on the full-scale of the data. Computationally, it is noted that the proposed method achieves improved accuracy by using the same amount of training epochs.

The proposed MR framework was validated on recorded sEMG and motion data from a subject [1] and significant improvements were observed in the testing prediction accuracy, with 1-scale training often leading to large errors. The predicted motion at higher training scales showed improvements across all initialization points used, indicating the robustness of the method. The identified parameters were also consistent with the physiological range observed in literature.

This method also has the advantage of operating in the time-domain as compared to the feature-encoded (FE) training [1], where the input sEMG signals were projected on to the frequency domain using the Fourier basis. In the FE training, to make a prediction, the input signal for the entire duration of the movement prediction was needed whereas the physics informed MR training of the RNN enables the trained model to make real-time predictions by using the information of the previous time-steps and the current sEMG signal. In addition, for mixed frequency signals, wavelet resolution can better capture the local frequency information as compared to the Fourier basis which captures the global frequency information. As compared to the NN-based time-domain training performed at the original scale of the data (scale [0]) proposed in [1], the MR PI-RNN training approach described here achieved significant improvements due to the stronger sequence learning capability of the RNN and the ability of the MR training.

This method is presented as a general approach where multi-resolution is applied to both input and output. For some applications, e.g., those that require only data mapping, the MR training can be applied by only considering the decomposition to the input, keeping the output at the full-scale (i.e., scale [0]) throughout, or vice versa. To apply this method to clinical studies, RNN hyperparameters may also be tuned to account for subject variability. The dependence of this method on the number of data points available in a signal can also be studied in the future. To further improve this method, we can consider the use of multi-resolution as activation functions of the ML framework, instead of relying on data filtration processes for better computational efficiency. This method will also be studied on other physics-informed ML techniques to solve forward problems with PDEs having mixed frequency source terms.