Introduction

Up to 80% of adults experienced chronic non-specific low back pain (LBP) at some point in their lives [1]. Global Burden of Disease’s research shows, LBP is the main reason as the primary cause of disability [2]. Despite considerable research on the etiology of LBP, the pathomorphological relationships are not yet fully understood. One factor that might affect LBP is the composition and morphology of the paraspinal muscles [3, 4].

Currently, quantitative and qualitative methods are being used to assess the composition of paraspinal muscle tissue. Qualitative assessment refers to the utilization of visual grading methods to assess the extent of fatty infiltration. The reliability of Goutallier classification system (0–4 grading scale) measurements of the degree of fatty infiltration in the lumbar region has been questioned previously [5]. Therefore, the quantitative measurements such as the assessment of the cross-sectional area (CSA) and the functional cross-sectional area (FCSA), which results from the exclusion of the fat compounds of the muscles cross-sectional area have gained growing interest [6,7,8,9,10]. A quantitative assessment of the paraspinal muscle composition using MRI is performed by segregating the pixels within the region of interest that is thought to represent fat. These techniques could be performed using a manual segmentation method or different threshold methods [11, 12].

Although it is assumed that the measurement error is mainly related to the observer and the method used, another problem might be the use of different software, which can lead to incomparable results [13]. Therefore, it is important to verify whether direct comparisons can be made between the various freeware or commercial software packages used for this procedure. Whereas Fortin et al. [14] reported an excellent agreement between ImageJ and OsiriX in the assessment of paraspinal muscle CSA, composition, and side-to-side asymmetry, the amount of measurement error in between the U.S. Food and Drug Administration certified software package Amira (version 2019.4, Thermo Fisher Scientific Inc. Waltham, USA) and the freeware ImageJ (version 1.53, National Institutes of Health, Bethesda, Maryland, USA) is unknown. Therefore, the objective of this study was to elucidate the influence of inter-software differences between Amira and ImageJ as well as the influence of segmentation techniques.

Materials and method

Study design

In this retrospective study, we randomly selected 60 MRIs (39 women and 21 men) of the lumbar region from a sample of a large cohort study, which was approved by the local ethics committee (EA1/058/21). MRI scans were conducted using a Siemens Avanto 1.5 T MRI system (Siemens AG, Erlangen, Germany) with T2-weighted turbo spin echo sequences for both axial and sagittal images. The axial T2 parameters used were a repetition time of 4.000, an echo time of 113, and a slice thickness of 3 mm. As the vast majority of degenerative changes can be detected in the lower spine, the levels L4–L5 and L5-S1 were evaluated.

Muscle measurements and segmentation

All measurements were performed by two orthopedic residents, who were trained in the MRI muscle assessment. The MRI images were measured through the two different image processing programs (ImageJ and Amira). The two observers measured the MRIs in a random order for both investigators. The CSA of the multifidus muscle (MF) and erector spinae muscle (ES) was measured at mid-disk level L4/5 and L5/S1 (Fig. 1), the CSA was single measured before applying any thresholds. FCSA and FCSA/CSA were determined using two different segmentation thresholds for differentiating muscle fibers and fatty muscle infiltration.

Fig. 1
figure 1

L5/S1 MRI of the same subject, A and B are processed by ImageJ, C and D by Amira

Circle method: Six regions of interest (ROIs) from the muscles of the MF and ES were taken from the visible areas of muscle tissue with least visual fatty infiltration. The maximum value that can be obtained from a sample ROI is regarded as the upper threshold to distinguish between muscle tissue and fat. Since the lower limit is usually 0 or close to 0, uniformly setting the lower limit at 0 is used to minimize errors (defined as Circle method) [14].

Overlap method: Outline CSA of paraspinal muscle (include ES and MF) and subcutaneous fat (SC) on both sides. By presenting the grayscale ranges for both CSA and SC as histograms and overlaying them, it was possible to identify signal intensities that were common to both images. The Overlapping area of the histograms represents the intensity of the fatty signal in the CSA (defined as Overlap method) [15].

Data analysis

For each measurement, descriptive statistics such as means \((\overline x)\) and standard deviations (SD) were calculated. The inter-rater, inter-software, and inter-threshold reliability of the measurement were evaluated using intra-class correlation coefficient (ICC). Agreement was defined according to Portney and Watkins [16]: an ICC of 0.00–0.49 is considered poor, 0.50–0.74 is moderate, and 0.75–1.0 is excellent. As Bland and Altman suggested [17, 18], the 95% limits of agreement were used to evaluate the agreement between the measurements acquired by different raters using different software with different thresholds. The standard error of measurement (SEM) is a statistical metric used to estimate the expected error associated with a specific measurement \(\left( {{\text{SEM}} = S\sqrt {1 - rxx} } \right)\), where S is the standard deviation of the test and rxx represents the reliability of the test. In this study, the results were analyzed based on the muscles and spinal level that were investigated. The Wilcoxon Rank Sum Test is employed to analyze systematic differences between different thresholds. The statistical analysis was conducted using Statistical Package for the Social Sciences version 23.0 (SPSS Inc, Chicago, Illinois).

Based on Cohen’s suggestions [19], By utilizing G*Power version 3.1.3 (University of Düsseldorf, Düsseldorf, Germany), effect size conventions were provided in categories of “small,” “medium,” and “large” to determine the required sample size. In this study, with an effect size of 0.3, alpha error of 0.05, and a power (verification) of 0.8, the minimum sample size of 46 participants was determined. Therefore, the enrollment of 60 patients was considered adequate to achieve the desired statistical power.

Results

Inter-software reliability of muscle measurements using ImageJ and Amira

The outcomes of inter-software reliability (ICC), SEM values, and descriptive statistics (mean SD) are presented in Table 1. All ICC of CSA, FCSA, and FCSA/CSA of all the muscle composition measurements, regardless of the threshold methods, analyzed muscle or spinal level, showed excellent agreement, and varied between 0.75 and 0.99. SEM also showed good comparability for different software, muscle measurements, muscle analyzed, and spinal segments. In Figs. 2 and 3, Bland–Altmann Plots illustrate the agreement between Amira and ImageJ and the solid line consistently crosses the y-axis above zero, indicating a systematic trend where the mean values of FCSA were consistently higher when utilizing the Amira.

Table 1 Inter-software reliability indexes between Amira and imageJ for the right MF and ES muscles
Fig. 2
figure 2

Bland–Altman 95% limits of agreement plots for the FCSA measurements of the right MF and ES at L4–L5 and L5–S1. The solid line represents the mean difference between the two measurement methods (i.e., Amira value-imageJ value), the dotted line represents the 95% limits of agreement for the difference (defined as the mean ± 1.96 SD)

Fig. 3
figure 3

Bland–Altman 95% limits of agreement plots for the FCSA% (100*FCSA/CSA) measurements of the right MF and ES at L4/L5 and L5/S1. The solid line represents the mean difference between the two measurement methods (i.e., Amira value–ImageJ value), the dotted line represents the 95% limits of agreement for the difference (defined as the mean ± 1.96 SD)

Inter-rater reliability of muscle measurements using imageJ and Amira with Overlap method and Circle method

The results of inter-rater reliability (ICC), SEM values and descriptive statistics (mean ± SD) data of L4–L5 and L5-S1 MF and ES between different software are given in Table 2. When measured using Amira, the ICC ranged from 0.75 to 0.99 for the inter-rater reliability of the Overlap method and 0.89 to 0.98 for the Circle method for both spinal levels. Inter-rater reliability of the Overlap method for both spinal levels ranged from 0.75 to 0.99 when measured using ImageJ, and ICC ranged from 0.88 to 0.98 for the Circle method. There were no significant differences observed in the ICC ranges for the inter-rater reliability analysis using two software and two threshold methods. However, compared with the Overlap method, the ICC of the Circle method is higher, and the SEM value is also slightly higher.

Table 2 Inter-rater reliability for different software and threshold segmentation

Inter-threshold reliability of muscle measurements using the Circle method and Overlap method

The ICC of MF and ES composition between the two different threshold methods showed poor or moderate agreement in both software programs (Table 3). Accordingly, the SEM value of the ES muscle and MF muscle in each software was high. All FCSA and FCSA/CSA measured using the Overlap method exhibits a notably greater extent compared to measurements taken with the Circle method, and this disparity demonstrates statistical significance (P < 0.01).

Table 3 Inter-threshold reliability indexes between Circle and Overlap method for the right MF and ES muscles

Discussion

During the quantitative assessment of lumbar paraspinal muscles composition, we compared the differences of segmentation of paraspinal muscles (MF and ES) by using different thresholding methods (Circle and Overlap) with different software (ImageJ and Amira). The agreement of the relevant paraspinal muscle measurements between these two distinct image processing programs demonstrated excellent reliability. These findings are supported by the Bland and Altman limit of agreement, which indicate that the agreement between the two software programs is acceptable, and they can be used interchangeably. In addition, similar inter-rater and inter-software reliability coefficients and SEM indicated that the software used contributed little to measurement error. Supporting the results of prior research we found an excellent inter-rater reliability in CSA and FCSA measurements [20, 21]. However, relevant differences between the two threshold methods were observed. The agreement of related paraspinal muscles between these two methods is low or moderate.

Both threshold methods have been used in prior work and used in comparisons, although the agreement and reliability between the two threshold methods could not been confirmed by this study [14, 15]. However, the differences might result from the two different ways to determine the upper threshold and therefore lead to different results. However, research about reliability and agreement of inter-threshold comparisons is very rare. Fortin et al. [22] compared an automated thresholding algorithm with the Circle method, for which excellent agreement between 0.79 and 0.99 was reported. Besides that study, to the knowledge of the authors there is paucity in inter-threshold comparisons.

As the thresholding is crucial for muscle segmentation there are some studies proposing different manual, semi-automated, or automated approaches. Otsu et al. [23] presented a method to select a threshold automatically from a gray-level histogram. Cooley et al. [24] acquired an initial histogram for each image by first outlining both MF (connected via the subcutaneous fat but excluding any vertebral structures). Ranson et al. [25] proposed tissue differentiation based on manual segmentation of the three tissue types within the MRI vertebral bone, paraspinal muscles, and intermuscular fat. The resulting grayscale values for the three tissue types were then normalized to the total number of pixels analyzed to determine the grayscale range of MR signal intensities for the three tissue types across the scan set. Although these articles suggest methods that could be supposedly effective, they do not provide a gold standard for assessing the infiltration of fat in the paraspinal muscles.

The paper’s main limitation is that only two image processing software for the quantitative assessment of paraspinal muscle composition was compared even if there exists a wide variety of different software approaches. Besides the two compared manual threshold methods there exists further manual, semi-automated or automated approaches which could not be compared within this paper.

Conclusion

In conclusion, the presented method to study paraspinal muscle CSA and composition has a high degree of reliability with very good agreement between the two software programs. However, the comparison between the two different thresholding approaches presented mostly moderate or poor reliability and therefore the results of these different thresholding methods should not be compared against each other.