Introduction

Osteoporosis is a worldwide health problem contributing considerably to health care costs, morbidity and mortality [1]. Due to the increasing age of the population, this problem will become even more substantial over time. Osteoporosis is a good candidate for screening, because it has a long preclinical phase and cost-effective therapeutic possibilities are readily available [2]. Therefore, early detection may be a reasonable strategy to prevent osteoporosis related fractures [3].

Bone mineral density (BMD) can be derived from quantitative computed tomography (QCT), ultrasound, and dual energy x-ray absorptiometry (DXA), but the World Health Organization (WHO) definition for osteoporosis only includes the DXA T-score [4]. DXA is a strong predictor of bone deterioration and fracture risk [5]. Nevertheless, osteoporosis remains underdiagnosed in the general population [6]. Recent studies proposed the use of regular clinical computed tomography (CT)-scans for bone mineral density assessment as an opportunistic screening method [714]. In this way, BMD surrogates could be derived from CT without additional radiation dose. Deterioration of BMD could be determined in the early stages of disease, which could prompt suitable medical treatment of osteoporosis, and thus prevent fractures in patients at risk. Additionally, scans performed in screening programs such as lung cancer screening or coronary artery disease evaluation may also be used, since BMD measurements on CT have been shown to predict all-cause mortality in lung cancer screening participants [15].

Bone density measurements on CT are mostly performed by manually placing a region of interest (ROI) in a lumbar or thoracic vertebra. Although in the future ROI measurements may possibly be executed automatically using software, the placement of a ROI is currently performed manually, and thus may vary in size, shape and location [16]. This could introduce variability in measurements and may depend on the experience of the observer. Furthermore, the inter-examination reliability is currently unknown. High inter-examination reliability would enable monitoring changes over time; for example, after an intervention. A few studies addressed the reproducibility of bone density assessment of the vertebrae [9, 17], but data is missing for unenhanced low-dose thoracic CT. Therefore, the aim of this study was to assess the inter-observer and inter-examination reliability and agreement in attenuation measurements of the vertebrae on low-dose unenhanced CT.

Materials and methods

Subjects

The study population was derived from the Dutch and Belgian Lung Cancer Screening Trial (NELSON). For this study, subjects from the University Medical Center Utrecht were included. A detailed description of study rationale and design is published elsewhere [18, 19]. In short, current and former smokers aged 50–75 years were included. Included participants had a smoking history of >15 cigarettes/day during >25 years or >10 cigarettes/day during >30 years. The NELSON trial was approved by the Dutch Ministry of Health and the local institutional ethical review board. Written informed consent was obtained for all participants. Participants who underwent a short-term follow-up CT after 3 months because of an intermediate risk lung nodule were included. Participants were excluded if the scan protocol, especially the kilo-Voltage, differed between baseline and follow-up, or when the interval between the CTs was longer than 100 days.

Image acquisition

All participants underwent a low-dose inspiratory volumetric CT scan using a 16-slice scanner (Brilliance 16P or MX8000 IDT; Philips Healthcare, Best, The Netherlands). The same scanning protocols were used at baseline and follow-up in the subjects included in this study. CT data were obtained using 16 × 0.75 mm collimation (pitch = 1.3) [20]. No intravenous contrast injection was applied. Participants weighing less than 80 kg were scanned with 120 kVp at 30 mAs. Participants weighing 80 kg or more were scanned with 140 kVp at 30 mAs. Slice thickness was 1.0 mm and axial images were reconstructed at 0.7 mm increment, using a smooth reconstruction filter (B-filter; Philips Healthcare, Best, The Netherlands). The scanners were calibrated according to the manufacturers recommendations and screening scans were obtained within 24 hours after calibration.

Assessment of BMD – ROI placement

Before measurements were made, all observers attended an interactive training session to become skilful in the measurement technique. First, observers were instructed to evaluate the first lumbar vertebral body (L1) for the presence of fractures and focal heterogeneities. Secondly, the upper part of L1 was identified by defining the area between the endplate and the entrance of vessels at the midportion. Thirdly, a rectangular ROI was as large as possible and positioned in a homogenous area of trabecular bone, without inclusion of the cortex and inhomogeneous areas. All measurements were performed in axial view. The mean CT attenuation was measured in Hounsfield Units (HU). An example is given in Fig. 1. If L1 was fractured, inhomogeneous or not visualized, the first intact vertebra from L1 upwards was measured. For instance, if L1 was fractured, the twelfth thoracic vertebra (T12) was used.

Fig. 1
figure 1

Example of placement of region of interest (ROI) in the first lumbar vertebra (L1) of a participant. Axial image of L1 in bone window (C: 300; W: 1600). A ROI is placed in the upper part of L1 between the endplate and the entrance of vessels at the midportion. The ROI was as large as possible and positioned in a homogenous area without inclusion of the cortex

Image analysis

Six observers with various levels of experience in CT reading participated in the current study: one board certified chest radiologist with 10 years of experience (P.J.), one board certified radiologist with 8 years of experience (R.B.), three research physicians with 2 years of experience (R.T., M.W. and E.P.) in CT reading and one medical student (W.J.).

To obtain inter-observer agreement, all six observers performed ROI measurements in 100 randomly selected CT scans. Measurements were conducted independently by all observers without knowledge of the outcomes of the other observers. For the inter-examination reliability and agreement, one observer (E.P.), who showed good agreement with the experienced observers, examined all baseline and follow-up CTs. Measurements at follow-up scans were made in the same session as the baseline scans. All scans were measured in a random order and the reader was blinded for acquisition date. Measurements were made at the same vertebral level for baseline and follow-up. If measurements in L1 were not feasible because of inhomogeneous areas or was not included on CT, the first intact vertebra from L1 upwards was measured both at baseline and at follow-up. Observers performed measurements in bone window (C: 300; W: 1600) and were blinded for subject characteristics.

Statistical analysis

Inter-examination reliability was assessed using the single measures intraclass correlation coefficient (ICC). ICCs were compared using a student t-test. The limits of inter-examination agreement were defined as the mean difference ± 1.96 × the standard deviation (SD) and was plotted using the Bland-Altman method [21].

Inter-observer reliability was estimated for all measurements made in 100 participants, using the single measures ICC. Inter-observer agreement was calculated using the mean difference ± SD and was assessed by a graphical method proposed by Jones et al. [22] This method is based on the Bland-Altman graphical method for the assessment of agreement between two observers, and is modified to allow for agreement between multiple observers. The limits of agreement from the mean represent how different a measurement of an individual observer can be compared with the mean measurement of all observers.

In addition, the effect of inter-examination variability or inter-observer variability on reclassification of patients as osteoporotic versus non-osteoporotic based on previously suggested cutoffs was calculated. To define osteoporosis a threshold of 110 HU was used, as derived from Pickhardt et al. [7] All analyses were performed using SPSS Version 20.0 (SPSS, Chicago, Illinois, USA). P values below 0.05 were considered statistically significant. Results are reported according to the GRRAS guidelines [23].

Results

Baseline characteristics

In total, 539 participants were rescanned after 3 months because of an indeterminate lung nodule. One hundred randomly selected scans were selected for inter-observer analysis. For inter-examination analysis, 97 participants were excluded because of the use of a different kVp at baseline and follow-up. Another 75 participants were excluded because of a follow-up time of more than 100 days; this resulted in 367 eligible participants for inter-examination analysis. A flowchart of in- and exclusion of participants is shown in Fig. 2. Mean ± SD age was 60.6 ± 5.9 years. Median time between the baseline and follow-up CT was 91 (P25 – P75: 91 – 91) days. Additional baseline characteristics are presented in Table 1.

Fig. 2
figure 2

Flow-chart of in- and exclusion of participants

Table 1 Baseline characteristics (n = 367)

Inter-examination variability

287 (78.2 %) measurements were made at level L1, 69 (18.8 %) at level T12, 10 (2.7 %) at level T11 and 1 (0.3 %) at level T10. Mean ± SD bone density was 108 ± 35 HU at baseline and 107 ± 35 HU at follow-up. The inter-examination agreement for ROI measurements of 367 participants was excellent with an ICC of 0.92 (0.90 – 0.94, p < 0.01) and did not differ significantly between men and women (0.92 (0.91-0.94) vs. 0.94 (0.90-0.96)). Inter-examination ICCs did not differ significantly between vertebral levels with an ICC for L1 of 0.92 (0.90–0.94), an ICC for T12 of 0.91 (0.86–0.95) and an ICC for T11 of 0.92 (0.72–0.98). Mean ± SD difference between baseline and follow-up was 1 ± 14 HU. The inter-examination agreement is plotted in Fig. 3. Limits of agreement were -26 and 28 HU.

Fig. 3
figure 3

Bland-Altman plot displays the inter-examination differences in vertebral Hounsfield unit (HU) measurements. Agreement is shown for 367 participants. The mean of differences (solid horizontal line) was 1 HU. The upper dashed line shows the upper 95 % limit of agreement (28 HU), and the bottom dashed line shows the lower 95 % limit of agreement (-26 HU)

Inter-observer variability

Eleven (1.8 %) measurements were missing due to software problems. For this reason, multiple imputations were used to complete the data set (20 imputations). Of all measurements, the mean ± SD HU-value was 100 ± 28 HU. Mean HU-values for all observers ranged from 95 to 105 HU, with SD ranging from 28 to 33 HU. Overall inter-observer reliability for measuring HU-attenuation of the vertebrae was excellent, with an ICC of 0.82 (p < 0.001). All ICCs compared between two observers are shown in Table 2 and varied from moderate (0.70) to excellent (0.91). The mean difference ± SD between all observers was 1 ± 6 HU. The inter-observer agreement of all 100 examinations is plotted in Fig. 4. Limits of agreement with the mean ranged from -12 to 12 HU.

Table 2 Inter-observer agreement for ROI measurements
Fig. 4
figure 4

Inter-observer agreement plot between six observers for vertebral attenuation measurements. Observers represent different symbols. The differences of all measurements with the mean (y-axis) are plotted against the mean Hounsfield unit (HU) values for all participants (x-axis). The horizontal dashed lines indicate the limits of agreement with the mean of the six observers and ranged from -12 to 12 HU. This could indicate that the observers can be discordant with the mean BMD by as much as 12 HU

Reclassification analysis

For reclassification analyses, only measurements at L1 were used (n = 287). Based on a threshold of 110 HU, 159 (55.4 %) participants were classified as having osteoporosis at baseline and 163 (56.8 %) participants at follow-up. Fourteen (4.9 %) participants were classified as having osteoporosis at baseline, but not at follow-up, and 18 (6.3 %) participants were classified as having osteoporosis at follow-up, but not at baseline. Table 3 shows reclassification results due to inter-examination variation.

Table 3 Reclassification analysis according to inter-examination variability

When using the same threshold of 110 HU to define osteoporosis in 77 participants with inter-observer measurements at L1, all six observers agreed about the diagnosis (osteoporosis yes/no) in 60 (77.9 %) participants and at least one observer did not agree in 17 (22.1 %) participants.

Discussion

In this study, we found excellent inter-examination reliability for manual bone density measurements of the vertebrae. Limits of agreement ranged from -26 to 28 HU, which means a change of at least 28 HU is needed in order to detect a real change in bone attenuation. Therefore, these results have to be taken into account when planning to use bone density measurements for longitudinal studies (e.g., for measuring therapeutic effects). Inter-observer reliability was good to excellent and limits of agreement with the mean ranged from -12 to 12 HU, which indicates that observers can be discordant with the mean estimated bone attenuation by 12 HU.

Our results imply that manual placement of a ROI in L1 is a reliable method for the quantification of vertebral attenuation. Therefore, in a lung cancer screening setting, low-dose chest CTs may be used to measure bone attenuation. Because these measurements are performed manually, in theory, experience could influence the precision of the measurement. However, the present study shows that radiological experience has no major effect on attenuation measurements. Moreover, ICCs between more experienced observers were not better than between less experienced observers. Low-dose CT scans could therefore gain a role in early detection of osteoporosis.

With the recent recommendation on the implementation of lung cancer screening [24], a large number of subjects will receive a low-dose chest CT. Next to screening for lung cancer, this can provide an opportunity for the assessment of other abnormalities, such as chronic obstructive pulmonary disease and coronary artery calcifications [25]. Because smoking is associated with lower bone density [26], this could be an opportunity for the detection of osteoporosis in this smoking population. By diagnosing low bone density as well, this could improve the yield and cost-effectiveness of lung cancer screening.

To our knowledge, this is the first study to describe the inter-examination agreement and reliability of attenuation measurements of the vertebrae on unenhanced low-dose CT in a large population. In addition, we extensively studied inter-observer agreement and reliability. Although several studies used attenuation measurements in the search for an appropriate screening tool for osteoporosis, studies on the agreement and reliability are lacking.

Ohara et al. [9] studied the correlation between pulmonary emphysema and reduced bone density. For this purpose, they used manual vertebral bone measurements. They validated their measurements by calculating correlation coefficients of two observers. This resulted in ICCs of 0.995, 0.993, 0.950 and 0.996 for T4, T7, T10 and L1, respectively. Their strength was the evaluation of multiple vertebral levels, but they concluded that the average bone density of three thoracic vertebral bones was highly correlated with bone density in L1 alone (r = 0.914, p < 0.001). Pickhardt et al. [7] elaborated on this and found that measurements at L1 are as or more accurate than the results at other levels, including multilevel assessment. Also, Romme et al. [27] showed no added value of using three thoracic vertebral levels to assess bone density compared to one measurement at L1. Although L1 seems to provide the most accurate results in terms of attenuation measurements, this vertebral level is not always included on thoracic CT. In our population of 376 participants, 89 (23.7 %) measurements were made at a vertebral level different from L1.

Both Pickhardt et al. [17] and Romme et al. [27] studied inter-observer agreement and found limits of agreement between two observers of -6 HU to 16 HU for T12-L5 and -9 to 5 HU for T4-T7-T10, respectively. We complemented to this by using six observers with different experience levels and showed limits of agreement with the mean ranging from -12 to 12 HU. The intra-observer limits of agreement from Romme et al. ranged from -9 to 5 HU in 20 participants. Our limits of agreement were substantially wider, ranging from -26 to 28 HU, but could be more realistic as a result of a larger study cohort.

Next to presenting positive results in terms of agreement and reliability, it is important to estimate the impact of these results on clinical practice. In order to perform reclassification analyses, we used a threshold of 110 HU to define osteoporosis, which was derived from Pickhardt et al [7]. This threshold was proposed for a routine care population with lower osteoporosis risk because of its high specificity. Buckens et al. [28] validated this threshold as being the most optimal as compared to DXA. By using this threshold, 159 (55.4 %) participants were classified as having osteoporosis. This high prevalence is in line with some findings of osteoporosis prevalence in a high-risk chronic obstructive pulmonary disease (COPD) cohort [29]. With this heavy smoking population being at risk for osteoporosis as well, these prevalence numbers could be appropriate. Another explanation for the high prevalence could be that the HU in the vertebra was systematically lower compared to the study by Pickhardt and Buckens.

Our reclassification analysis showed that inter-examination variability can lead to a different diagnosis in 11.2 % of included participants. Moreover, variability of inter-observer measurements can lead up to 22.1 % misclassified participants. As a consequence, when measuring bone density that is close to a threshold that defines disease, the effect of variability within a patient and between different observers could be substantial. Considering the development of guidelines for osteoporosis screening, variability consequently has to be taken into account.

Our study has limitations. First, the follow-up scans for the assessment of inter-examination variability were performed three months after baseline. In this period, CT attenuation values could have altered. However, we think the impact will be limited because decline in bone density progresses slowly. Still, a follow-up CT examination directly after baseline would be more ideal to eliminate changes over time. Second, we only used measurements of one vertebra in our evaluation and did not include more vertebral levels. Nevertheless, one may assume that, even if bone attenuation may vary at each vertebral level, inter-observer and inter-examination agreement may be similar [9]. Thereby, former studies have shown that ROI placement in multiple vertebrae does not add value compared to one measurement at L1 [7, 27]. Third, as a consequence from the study design of the lung cancer screening trial, only a small amount of our cohort consisted of women. But, in this cohort, no difference was seen in inter-examination differences in HU between men and women. Lastly, although our scanners were calibrated weekly, we did not use a calibration phantom in this study as is done in QCT of the spine. We were therefore unable to provide BMD as milligrams hydroxyapatite per cubic centimetre and our method has lower precision compared to QCT [30, 31]. However, previous studies have shown that, although precision was lower compared to QCT, BMD estimation techniques without phantom calibration were nevertheless promising for assessing fracture risk [11].

In conclusion, this study shows that bone attenuation can be measured by manual ROI placement on unenhanced low-dose chest CT examinations with good reliability. However, when developing guidelines for early detection of osteoporosis, variability still has to be taken into account. While the distinctive character of this technique is excellent, diagnostic studies are needed to confirm these results, to evaluate its accuracy and ultimately its cost-effectiveness.