Introduction

Muscle strength is a central component of function [1,2,3,4]. Deterioration in muscle strength below critical thresholds can have a significant impact on an individual's ability to accomplish activities of daily living [2, 3] and locomotion [5,6,7]. Physiotherapists need to adequately measure the magnitude of muscle weaknesses, as they will guide the clinical management of a given condition [8].

Different methods exist to measure maximal isometric muscle strength (MIMS), but present characteristics limiting their usefulness in clinical decision-making. The isokinetic dynamometer, for example, is the gold standard for measuring muscle strength [9], but it is costly and requires a large space and considerable user training, limiting its clinical accessibility. Manual muscle testing (MMT) is easy and quick to perform, and does not require any equipment [10], but presents poor psychometric properties [10]. Indeed, MMT lacks sensitivity to identify changes in muscle strength over time [11, 12]. Quantitative muscle testing (QMT) using a handheld dynamometer (HHD) is a promising alternative for muscle strength assessment. HHD is simple, affordable, accessible for clinicians, and more accurately detects muscle weakness than MMT [11,12,13]. QMT has good to excellent psychometric properties for different muscle groups evaluated in various populations [9, 14,15,16]. Indeed, MIMS values obtained with HHD show good concurrent validity with isokinetic dynamometry [9, 17, 18] and good to excellent reliability for most muscle groups [14, 19,20,21,22,23,24]. To be confident that muscle strength changes are true changes rather than the result of measurement error, clinicians should ensure that the measurement error of the chosen outcome measure is small [25]. This can be assessed using measurement error parameters such as the standard error of measurement (SEM), limits of agreement (LOA), and minimal detectable change (MDC) [25].

Previous studies have showed good to excellent intra- and inter-rater reliability of HHD muscle strength measurements for different numbers of muscle groups, except for the ankle muscle groups which showed moderate intra- and inter-rater reliability [14,15,16, 19,20,21, 26, 27]. However, none has assessed the intra and inter-rater reliability of a standardized HHD protocol for the assessment of muscle torque for multiple key muscle groups of the upper and lower limbs essential to achieving daily activities. Moreover, the protocols and the types of devices used in these studies have several limitations that discourage their use in research and clinical settings including overlooking the effect of gravity, not measuring the lever arm, a lack of joint stabilization especially for strong muscle groups, and a lack of device stability due to the poor ergonomics of the HHD used [28].

The objectives of this study were to determine the intra- and inter-rater reliability, agreement, SEM, and MDC of the muscle strength torque values of 17 muscle groups of the upper and lower extremities in healthy adults, obtained with a standardized protocol using a push–pull HHD. Based on the results obtained by Hébert et al. [29], our hypothesis is that intra- and inter-rater reliability will be good to excellent for all muscle groups tested.

Methods

Participants

A convenience sample of 30 healthy adults aged between 18 and 70 years old was used for this study. Based on data obtained in a previous intra-rater reliability study for knee extensors assessment using the same protocol (ICC = 0.98) [30] and according to the review of Bujang and Baharum [31], the sample size was determined using a 80% power, α = 0.05, minimum acceptable reliability of 0.5, and an expected good to excellent reliability > 0.75. Participants were recruited through advertisements in newspapers, social networks, contact lists of different employers, and posters placed in public areas. Participants were included if they were available to take part in the protocol spanning half a day. They were excluded if they presented any of the following criteria: 1) participation in sport at a competitive level; 2) degenerative or neuromusculoskeletal disease that could affect torque measurements; 3) traumatic experience or disease in the previous years that could affect their muscle capacity and strength; and 4) use of medication that could impact muscle strength (e.g., muscle relaxants, analgesics, opioids) at the time of the evaluation. Written informed consent was obtained from each participant prior to the first assessment, and the study was approved by the Ethics Committee of the Integrated University Center of health and social services (CIUSSS) of the Capitale-Nationale.

Instrumentation

The MEDup™ HHD (Atlas Medic, Québec, Canada) was used in either compression or distraction mode depending on the muscle group evaluated. The dynamometer was set to read muscle strength values in Newtons. The calibration of the dynamometer was verified with reference weights at baseline and every 3 months to ensure validity and good measurement accuracy.

The measurements were performed by two independent raters who had received 3 full days of training on the standardized operative procedure and the HHD protocol. The training was followed by approximatively 20 h of practice. The first evaluator (E1) was a 31-year-old female physiotherapist who worked at the CIUSSS of Saguenay–Lac-St-Jean, with 4 years of clinical experience in geriatrics, and no experience using HHD. She was 5′10″ in height and weighed 63,6 kg. The second evaluator (E2) was a 23-year-old female physiotherapy technologist who worked in a private clinic, with one year of clinical experience, and no experience using HHD. She was 5’5” in height and weighed 85 kg.

Study protocol

Data collection of this cross-sectional study was conducted from January 2021 to October 2021. MIMS torque of 17 muscle groups of the upper (shoulder abductors, internal and external rotators, and flexors; elbow and wrist flexors and extensors) and lower (hip abductors, internal and external rotators, flexors and extensors; knee flexors and extensors and ankle dorsiflexors and evertors) extremities was measured using a standardized HHD protocol inspired by a protocol previously published by Hébert et al. [29]. The current protocol is described in detail for each muscle groups (subject’s and evaluator’s position, stabilization, adapter type and dynamometer placement and lever arm measurement) in the supplementary materials (see Additional files 1, 2 and 3). As shown in Fig. 1, measurements were taken during three different sessions (S1, S2 et S3) by two independent evaluators. MIMS torque of the right or left side of the 17 muscle groups was assessed during an initial evaluation session (S1) by the first evaluator (E1). Five days later, MIMS torque of the same side was assessed in a second session (S2) by the second evaluator (E2) to assess the inter-rater reliability. Finally, nine days later, the MIMS torque of the same side was measured in a third session (S3) by the first evaluator (E1) to assess the intra-rater reliability. The order in which muscle groups were assessed for each participant was determined during the first session using bloc randomization of the upper and lower extremities and muscle groups to control for learning effect and potential fatigue. This order was subsequently reproduced for each session. The side (right or left) being evaluated was alternatively selected between consecutive participants.

Fig. 1
figure 1

Study Protocol. Torque of 17 muscle groups was assessed by two independent raters at three different times (S1, S2, S3) over a 14-day period (n = 30 participants). Intra- and inter-rater reliability were determined by comparing the torque values obtained at S1 and S3 and S1 and S2 using intraclass correlation coefficients (ICC(3,k), ICC(2,k)

Assessment protocol

The following guiding principles were systematically applied for each muscle group tested: a. to control for the effect of gravity, each testing position was chosen to eliminate the effect of the evaluated segment's weight; b. the body of the dynamometer was aligned with the plane of movement and was perfectly perpendicular to the segment in order to register 100% of the force vector produced by the evaluated muscle group; c. to control for compensations, non-slip surfaces and rigid straps were used to stabilize and/or to perform closed chain evaluations, thus eliminating the effect of the evaluator; d. easy-to-palpate anatomical landmarks were chosen in order to accurately and reproducibly measure the lever arms, and; e. a comprehensive standardized training session for the evaluators that was long enough to allow them to integrate all these principles for each of the 17 muscle groups evaluated was provided. In each evaluation session, the limb was first placed in the testing position by the evaluator and a submaximal contraction of about 50% of the maximal effort was performed before each trial to ensure that the isometric contraction was well understood and executed, and that the stabilization of the segment was adequate. Then, the participant was asked to produce a maximal contraction by gradually pushing against the HHD (or by pulling the strap for the distraction mode), steadily increasing to their maximal effort, and maintaining the maximal effort until they were told to release. Contractions lasted for ten seconds. The following standardized verbal encouragement was given throughout the effort to ensure that the peak force was reached: “Go ahead, push, harder, push, go ahead, as hard as you can”. The intensity and tone of voice of the encouragements were gradually increased over the course of the 10-s contraction. Three trials were performed using isometric “make” tests, meaning that the evaluator holds the HHD still while the participant exerts a maximal force against it. The coefficient of variation between trials was calculated, and when it exceeded ten percent, additional trials were performed until obtaining three measures within ten percent of variation, up to a maximum of five measures. The three closest trials were kept for the final analyses. A minimum rest period of 30 s was allowed between each trial. If needed, an additional rest period was allowed to ensure that maximum strength was achieved for each trial and each muscle group. The lever arm was measured for each muscle group on each side, as described in the standard operating procedure in Additional file 1 of the supplementary material, to convert the MIMS obtained in Newtons into Newton-meter torque values. When required, rigid straps were used to: a. resist the contraction, b. inserting the HHD between the segment and the strap (hip extensors, knee extensors), c. stabilize the segment to avoid compensations (wrist flexors and extensors, hip abductors, ankle evertors), or d. perform the evaluation in distraction mode (hip flexors, hip abductors, knee flexors). Pain was assessed with a visual analogue scale, and when pain prevented the participant from reaching their maximal effort, the test was not repeated, and data were excluded from the final analysis. Evaluators make sure to correct the compensations that may occur (e.g., right body alignment, ensure that the starting position is maintained and that the stabilization is used only to stabilize and not to produce force). At the first assessment session, anthropometrical data such as age, gender, height, weight, and body mass index were also documented.

Statistical analysis

The mean of the three torque values (obtained by multiplying the strength values [Newton] by the lever arm [meter]) of each side were calculated for all muscle groups for each participant. Descriptive statistics (mean and standard deviation [SD]) of these means were calculated. Normality of the MIMS distribution for each muscle group was analyzed using Shapiro–Wilk tests. Descriptive statistics (mean, SD, frequency, and percentage) of participant characteristics were also calculated. Intra- and inter-rater reliability were calculated using intraclass correlation coefficients (ICC) with 95% confidence intervals (CI). Intra-rater reliability was calculated by comparing measurements taken by the same rater (E1) fourteen days apart (S1 and S3), using multiple measurements in a two-way mixed-effects model with absolute agreement. Inter-rater reliability was calculated by comparing the torque values obtained by two different raters (E1 et E2) five days apart (S1 and S2), using multiple measurements in a two-way random effects model with absolute agreement. ICC were qualified according to Koo and Li (2016), proposing that ICC greater than 0.90, between 0.75 and 0.9, between 0.5 and 0.75 and less than 0.5 suggests excellent, good, moderate, and poor reliability, respectively [32]. Bland and Altman (BA) plots were also used to evaluate the agreement between the measurements taken at different sessions. One-sample t-tests of the difference of scores obtained between measurement time-points were used to identify significant systematic bias and provide all the relevant data to calculate the limits of agreement and to draw BA plots. The SEM was calculated using the following formula: SDpooled*√(1-ICC), where the SDpooled is the average of the SD calculated from the 6 trials (3 trials in each session) for each participant [25]. MDC was also calculated with a 95% CI using the formula MDC = 1.96*SEM*√2, where 1.96 is derived from the 95% CI [25]. Pairwise deletion was applied in the presence of missing data. Significance was set at α < 0.05 and all statistical analyses were performed using SPSS (IBM SPSS Statistics 28.0 for Windows, Armonk, NY, USA).

Results

Participants

Fifteen women and seventeen men took part in this study. Two women dropped out after the first assessment session for personal reasons, leaving thirty participants who completed all three sessions. Participant characteristics are shown in Table 1. A minimum of 28 participants completed the three sessions for each muscle group (see Table 2). Three participants were unable to produce a maximal contraction for certain muscle groups due to pain or discomfort in specific joints (shoulder abductors and wrist flexors [n = 1], shoulder external rotators [n = 1], hip abductors and extensors [n = 1]). In addition, we were unable to assess a full maximal contraction of the hip flexors, internal and external rotators, and knee flexors of two participants due to a transient technical problem of the HHD in their second and third evaluation sessions. Finally, one participant's shoulder internal rotator strength could not be measured according to protocol due to the size of its abdomen.

Table 1 Participants’ characteristics
Table 2 Intra- and inter-rater reliability, standard error of measurement and minimal detectable change

Intra- and inter-rater reliability

Table 2 summarizes the descriptive statistics (mean and standard deviation) of MIMS torques, intra- and inter-rater reliability, SEM, and MDC values for all muscle groups.

Regarding the intra-rater reliability, the obtained ICC values (95% CI) for all muscle groups ranged from 0.902 (0.789–0.954) to 0.990 (0.978–0.995), indicating excellent intra-rater reliability for most of the muscle groups, except for the wrist flexors and extensors and the hip flexors, which showed good to excellent reliability. Absolute and relative SEM and MDC ranged from 0.14 Nm to 3.20 Nm and 0.5% to 2.84% for the SEM, and 0.38 Nm to 8.87 Nm and 1.38% to 7.88% for the MDC, respectively, for all muscle groups (see Table 2). Table 3 shows the t-values and corresponding p-values obtained using one-sample t-tests of the differences between the measurement time-points S1-S3, and S1-S2. Only the graphs of the muscle groups that showed a systematic bias between the two-measurement time-points (S1-S3 for intra-rater reliability and S1-S2 for inter-rater reliability) are presented. Other graphs can be consulted in the supplementary material (Additional files 4 and 5). As shown in Table 3 and Fig. 2, the absolute and relative mean difference between Sessions 1 and 3 all varied from 0.01 Nm to 7.4 Nm and 0.04% to 5.6%. Only four out of 17 muscle groups (shoulder flexors, elbow extensors, internal hip rotators, ankle evertors) showed a significant systematic bias.

Table 3 Intra- and inter-rater agreement according to Bland and Altman plots and limits of agreement
Fig. 2
figure 2

Bland and Altman plots, intra-rater assessment. Bland and Altman plots showing significant systematic bias of the mean difference of muscle torque in Nm between the first and third sessions of the shoulder flexors (A), elbow extensors (B), hip internal rotators (C) and ankle evertors (D). Limits of agreement (LOA) are identified by the dotted lines, from -1.96SD to + 1.96SD, and the mean difference by the red line. The mean difference confidence intervals are depicted by the shaded area

Regarding the inter-rater reliability, the obtained ICC values (95% CI) ranged from 0.888 (0.731–0.950) to 0.989 (0.978–0.995) indicating good to excellent reliability for the majority (15/17) of the muscle groups tested by two different raters. Only the wrist flexors and the hip internal rotators showed moderate to excellent inter-rater reliability. Absolute and relative SEM and MDC ranged from 0.17 Nm to 5.80 Nm and 0.49% to 3.25% for the SEM, and 0.47 Nm to 16.06 Nm and 1.35% to 9.02% for the MDC, respectively. Regarding Table 3 and the BA plots (Fig. 3), the absolute values of the mean of the difference between Sessions 1 and 2 all varied from 0.02 Nm to 8.5 Nm, except for the hip extensors, which showed a mean difference of -17.8 Nm. In relative values, the mean difference for all muscle groups varied from 0.3% to 12.6% of the MIMS torque values. Eight out of 17 muscle groups showed significant systematic bias according to BA plots (see Fig. 3). Other graphs can be consulted in the supplementary material (Additional files 6 and 7).

Fig. 3
figure 3

Bland and Altman plots, inter-rater assessment. Bland and Altman plots showing significant systematic bias of the mean difference of muscle torque in Nm between the first (S1) and second sessions (S2) of the shoulder flexors (A), wrist flexors (B), hip internal rotators (C) and external rotators (D), hip flexors (E) and extensors (F), knee flexors (G), and ankle evertors (H). Limits of agreement (LOA) are identified by the dotted lines, from -1.96SD to + 1.96SD and the mean difference by the full line in bold. The mean difference confidence intervals are depicted by the shaded area

Discussion

The intra- and inter-rater reliability and agreement of a standardized HHD protocol for most of the muscle groups of the lower and upper limbs (n = 17) were documented in this study. The results demonstrate good to excellent intra- and inter-rater reliability of the protocol for almost all the muscle groups tested. To our knowledge, this is the first study to assess the intra and inter-rater reliability of a HHD protocol for such many muscle groups. Moreover, the protocol used was rigorous and respected a series of biomechanical guiding principles of muscle strength assessment that allowed us to control for many potential sources of error.

Despite our unique protocol, our results are consistent with those of certain other studies, which showed good to excellent intra- and inter-rater reliability for several muscle groups [14, 15, 19,20,21,22, 24, 26, 33,34,35]. However, reliability values were higher for some muscle groups, such as the ankle dorsiflexors which showed poor to good intra- and inter-rater reliability in a few other studies using HHD [14, 15, 20, 24, 26, 36]. Muscle strength assessment of the ankle dorsiflexors is challenging for a few reasons, notably: there is a short lever arm resulting in poor mechanical advantage for the evaluator, and the inclined surface of the foot in the starting position of the test makes it more difficult to position the HHD perpendicularly to the segment. The observed difference in our study could be explained in large part by the type of device used and the position of the evaluator’s wrist. Most previous studies used a MicroFET or Lafayette HHD, which are both push dynamometers and quite different from the MEDup™ used in the present study [14, 15, 20, 24, 26]. The design of the MEDup™ offers a mechanical advantage; its pistol grip (inferior handle) and bilateral handles allow a neutral wrist position and enable the evaluator to resist the participant’s force with both hands, creating better stability across muscle groups.

Concerning the wrist flexors and hip internal rotators that showed lower inter-rater ICC values, we hypothesize that more compensations (internal shoulder rotators for the wrist flexors and hip abduction for the hip internal rotators) could have occurred for these two muscle groups, potentially causing greater discrepancy between the results obtained by the two independent evaluators. Another hypothesis for the wrist flexors is that error may have been introduced using the half-sphere adaptor of the HHD, which inhibits positioning of the dynamometer support in the same place at each trial, contrary to the HHD adaptors used for all the other muscle groups. The reliability of wrist flexor HHD muscle strength assessment was only evaluated in one other study, which reported ICC values of 0.86 in healthy adults [36]. However, considering the missing data (no 95% CI provided) and the use of a different protocol in Kilmer’s study, comparisons with our results are not possible [36]. As for the reliability of the hip internal rotators, a few studies have been conducted with variable results [23, 26, 37]. Unlike our results, Gonzalez-Rosalen et al. [23] showed excellent inter-rater reliability. In contrast, Thorborg et al. [37] revealed similar results to ours, with fair to excellent inter-rater reliability and no agreement between testers. However, the measurements in these studies were taken in the prone position instead of the seated position as in our protocol, which again limits comparisons. In our experience, assessing the hip rotators in the prone position increases possible compensations in the frontal plane, such as hip abduction and adduction, and it is also more difficult to keep the leg stable at 90° of knee flexion.

The results showed small measurement errors for the 17 muscle groups, with SEM and MDC all below 4% and 10% respectively in relative values for intra- and inter-rater assessments. According to the literature, a SEM of less than 10% is clinically acceptable [38]. Although Gonzalez-Rosalen et al. [23] reported good SEM values for 15 muscle groups, their use of Newtons rather than Newton-meters prevents comparisons with other studies, including ours. Also, these SEM values do not consider the error associated with measuring the lever arm, which is key to the biomechanics of strength assessment. Few studies have used the Newton-meter as a unit of force measurement, limiting comparisons to those that have. When comparing the results obtained in relative values, our results showed smaller SEM and MDC. For example, Buckinx et al. [15] showed large measurement error with relative SEM values varying from 26.56% to 101.1% for intra-observer and 17.11% to 115.29% for inter-observer. Mentiplay et al. [16], who evaluated intra- and inter-rater reliability of HHD for the assessment of isometric lower limb muscle strength found SEM varying from 5.29% to 10.81% and 4.54% to 12.53%, respectively. Altogether, studies that calculated MDC reported values greater than 10% for all muscle groups tested [15, 19, 39, 40] even if they only measured muscle strength values rather than torque values. By adding the lever arm measurement, one could expect the MDCs to be even higher considering that it adds another source of measurement error. These results highlight the excellent psychometric properties of our standardized HHD protocol.

Lastly, intra and inter-rater agreements using BA plots were determined to improve clinical interpretation of the agreement between the sets of measures and to validate the level of agreement quantified by the ICC [41]. Despite the high ICC values obtained for all muscle groups, no agreement between the measurements of four and eight muscle groups in intra- and inter-rater assessment, respectively, were found, which shows systematic biases between sessions and/or between testers. For the inter-rater assessment, a positive significant bias between testers was observed for a few specific muscle groups (wrist flexors, hip internal and external rotators and flexors, knee flexors, ankle evertors), meaning that E1 overestimated values compared to E2. The opposite was observed for shoulder flexors and hip extensors. Among the factors that could cause these biases, anthropometric characteristics and physical capacities of the raters could explain the perceived difference for certain muscle groups requiring greater ability to resist due to their greater strength, such as the shoulder flexors and the hip and knee flexors. Indeed, Gonzalez-Rosalen et al. [23], who compared pull and push dynamometry, found that pull dynamometry had better agreement between testers than push dynamometry, especially for stronger muscle groups due to the reduction of the examiner’s strength interaction in pull dynamometry. Also, some studies revealed significant systematic biases between raters that could be due to their capacity to resist stronger muscle groups [26, 27, 37]. However, in contrast to these studies, it is impossible to affirm that one evaluator rated systematically lower than the other. An analysis of our BA plots shows an increase in the magnitude of the mean difference with increasing mean torque values more specifically for the wrist, hip and knee flexors in inter-rater assessment, as seen in Fig. 3. This increase could be related to the smaller rater’s ability to resist greater levels of strength. Nevertheless, evaluator characteristics alone cannot explain all the differences. For some muscle groups, the role of the evaluator is less important and even zero (when assessed in a closed chain like for the knee extensors) and the assessment quality mainly relies on the positioning and stabilization of the HHD, as for the hip internal rotators and the hip extensors. Yet, these muscle groups show the greatest bias. Many other factors may come into play, such as positioning, participant compensations, and verbal stimulation. However, the standardized operating procedure should minimize such variability. These results demonstrate that this HHD protocol could still benefit from revisions to improve agreement between data, but the results obtained are much better than those of other studies [14, 15, 24, 26, 36]. This can be explained by the rigorous and novel approach of this study's protocol which is based on basic biomechanical concepts that do not seem to have been mentioned in the literature to date. The strict adherence to these guiding principles helps to control for errors associated with the handling of the HHD during testing and the data collection procedure. Consequently, the assessment of muscle strength with HHD allows reliable measurements even with inexperienced evaluators who have been appropriately trained.

This study present limitations. Although criterion validity of this standard operating procedure has been assessed in a pediatric population, it has not yet been assessed in the adult population. It would have been appropriate to do this in conjunction with the assessment of intra- and inter-rater reliability, but this would have required many additional resources and it was not the primary objective of our study. However, this step could be done in a future research project. The study sample size prevented analysis of the results by age categories and by sex. Such analysis would have facilitated use of the reference values established from our protocol. Since the measurements were taken in healthy adults with a well-defined procedure, the findings of this study cannot be generalized to other populations or types of protocols using different devices and/or different positioning.

Conclusion

Considering the excellent intra- and inter-rater reliability and the small error of measurement of the standardized HHD protocol for 17 muscle groups, the HHD protocol is a method of choice for MIMS torque measurements in clinical and research settings. Knowing the psychometric properties of MIMS torque values obtained with this HHD standardized measurement protocol will allow optimal use of the upcoming reference values.