Patients and nodule selection
The study was approved by our institutional review board, and written informed consent was obtained from all patients after explanation about the risks, including the additional radiation dose. Twenty consecutive adult patients (15 men, 5 women, 40–84 years old, mean 57 years) with known pulmonary metastases were enrolled. All patients visited the oncology outpatient department and were referred for a chest CT for clinical indications. The presence of lung metastases had been previously shown on chest CT or chest radiography. Patients were referred for chest CT to monitor the effect of anticancer therapy. Primary tumors were melanoma (n = 3), renal cell carcinoma (n = 6), colorectal cancer (n = 5), breast carcinoma (n = 2), prostate cancer (n = 1), seminoma (n = 1), medullar thyroid cancer (n = 1) and esophageal cancer (n = 1).
All solid lesions with a minimum volume of 15 mm3 (corresponding to a diameter of about 3 mm) were included. Lung masses, defined as nodules exceeding 30 mm in diameter, were excluded from analysis. Nodules suspected of being metastases were included, as well as nodules that could potentially have a benign histology. Completely calcified nodules, however, were excluded. Only solid nodules were included since non-solid or partly solid nodules require a different segmentation approach, and not all of the evaluated segmentation software packages were developed for this task. A maximum of 50 nodules per patient were included.
To have an independent indication of nodule size, the maximum diameter of the nodule was measured using an electronic ruler. Nodules were categorized by size according to the Fleischner criteria [6]. Nodule shape was categorized based on 3D images into spherical, lobulated or irregular. A nodule was defined as spherical when it had an approximately constant radius and as lobular when it had a variable radius, but smooth outer margins. It was defined as irregular when the outer margins were not smooth. Attachment to the pleura or pulmonary vessels was noted.
Image acquisition
Two low-dose non-contrast-enhanced chest CTs were performed, followed by a contrast-enhanced standard-dose chest CT for clinical purposes. Between the two low-dose CT examinations, patients were asked to get off and on the table to simulate the conditions of a repeat CT examination for follow-up of a pulmonary nodule. Using this setup, growth or regression of the lung nodules could reliably be excluded.
All CT data were acquired on a 16-detector-row CT system (Mx8000 IDT, Philips Medical Systems, Cleveland, OH) using a spiral mode with 16 × 0.75-mm collimation. The entire chest was examined. CT data were acquired in full inspiration. Exposure settings for the low-dose examinations were 30 mAs and 120 kVp or 140 kVp, depending on the patient’s weight. The corresponding volume CT dose indices were 2.2 mGy and 3.5 mGy, respectively. Axial images were reconstructed at 1.0-mm thickness and 0.7-mm increment, using a moderately soft reconstruction kernel, the smallest field of view that included the outer rib margins and a 512 × 512 matrix.
Lung volume in both examinations was measured using the lung segmentation algorithm incorperated in the GE software.
Semi-automated volume measurements of pulmonary nodules
All nodule measurements were done by a single observer (2 years of experience in radiology with special interest in CT lung cancer evaluation). Nodules were identified using axial thin-slab maximum intensity projections (slab thickness 10 mm) displayed with window/center settings of 1,500/-500 HU. The same nodule was identified on the follow-up CT images using a printed screenshot.
The following segmentation algorithms were evaluated: Advantage ALA (GE, v7.4.63), Extended Brilliance Workspace (Philips, EBW v3.0), Lungcare I (Siemens, Somaris 5 VB 10A-W), Lungcare II (Siemens, Somaris 5 VE31H), OncoTreat (MEVIS, v1.6), and Vitrea (Vital images, v3.8.1, lung nodule evaluation add-on included). For the purpose of anonymization, the characters A to F were randomly assigned to the various packages.
In all algorithms, segmentation was initiated by clicking in the center of a nodule, starting a fully automated evaluation of the nodule. Next, all algorithms segment the nodule, calculate its volume and present the result. The segmented area was shown by the various software packages by a thin line surrounding the area of the nodule or by a colored overlay. This segmentation was visually judged for accuracy. In order to minimize observer influence, only these automated results were used for comparisons in this study, except for when explicitly written otherwise.
We did a separate analysis for results obtained using manual correction of incomplete segmentations. Manual correction by the user was allowed to correct the segmentation by four of six packages. In case of a mismatch between nodule and segmentation, this feature was used to obtain the most precise segmentation feasible. The type of manual correction varied between the packages (Table 1). Two packages also allowed for a complete manual segmentation in case of failure; this feature was not used. Next, the segmentation was again visually judged for accuracy.
Table 1 Possibilities for manual correction used to correct the semi-automated nodule segmentation for the various software packages tested in this study
Evaluation of segmentation accuracy
In order to evaluate segmentation accuracy, all packages offered a volume-rendered display that could be turned and a thin-section image in at least one plane that could be scrolled back and forth. Two packages gave the possibility to evaluate the segmentation in other planes as well. The observer visually classified the segmentation accuracy into four categories: (1) ‘excellent’: excellent segmentation, the overlay completely matched the nodule; (2) ‘satisfactory’: although not perfect, the segmented volume is still representative of the nodule. The maximum mismatch between overlay and nodule is visually estimated not to exceed 20% in volume. (3) ‘Poor’: part of the nodule is segmented, but the segmented volume is not representative of the nodule (estimated mismatch >20%). (4) ‘Failure’: No segmentation or the result has no similarity with the lesion. An example of each classification can be found in Fig. 1. In order to exclude the influence of failed segmentations on the reproducibility of a software, nodules were grouped into ‘adequately’ (group 1 and 2) and ‘inadequately’ (group 3 and 4) segmented nodules. Inadequately segmented nodules were excluded from the calculations of inter-examination variability as these segmentations have no value and greatly influence volume measurement reproducibility, making meaningful comparisons impossible.
Reproducibility of visual assessment of segmenation
The intra- and interobserver reproducibility for the visual assessment of segmentation accuracy was tested. On one system, the observer performed the visual assessment twice with 1 week in between readings. A second observer, a CT technician with special training in evaluating and reporting cancer screening CTs with the use of volumetric software (>4,000 examinations in 3 years), repeated the visual assessment of the segmentation accuracy as well, on the same and on a second system. We identified the percentage of nodules in which the visual assessment of segmentation changed between adequate and inadequate.
Statistical evaluation
All statistics were calculated using Microsoft Excel XP (Microsoft, Redmond, Wash.) and the SPSS statistical software package version 15 (SPSS, Chicago, Ill.).
To assess the effects of inspiration level, we calculated the Pearson correlation coefficient between the relative difference and the ratio of lung volumes (first/second examination).
In order to compare the number of adequately segmented nodules per system, a binominal test was applied, using the percentage of adequately segmented nodules of the best system as test proportion.
Differences in volume (ΔV) were calculated by subtracting the volume measured on the first scan (V1) from the volume measured on the second scan (V2). This difference was then normalized with respect to mean nodule volume to assess relative differences:
$$\Delta V_{{\text{rel}}} = 100\% \cdot \frac{{V_2 - V_1 }}{{{{\left( {V_1 + V_2 } \right)} \mathord{\left/ {\vphantom {{\left( {V_1 + V_2 } \right)} 2}} \right. \kern-\nulldelimiterspace} 2}}}.$$
The histogram of relative differences showed a normal distribution for all packages (tested with the Kolmogorov-Smirnov test). Because the same nodule was measured twice on successive chest CTs, a mean relative difference close to 0 can be expected. In fact, none of the packages had a mean relative difference higher than 1.1%. We therefore decided to use only the upper limit of agreement of the 95% CI of the relative differences as assessed according to the method proposed by Bland and Altman [7] as the measure of interexamination variability. An increase in nodule volume above this upper limit of agreement can, with 95% confidence, be attributed to real growth.
To compare the various software packages with respect to interexamination variability, we used an F-test on a subgroup of nodules that were adequately segmented on both scans by all packages.
We also tested whether there was a significant difference in interexamination variability between excellently and satisfactorily segmented nodules. For each software package separately, an F-test was used to compare interexamination variability for all those nodules that were classified as excellently or satisfactorily segmented with this specific software.
An F-test was also used to test for differences in interexamination variability before and after manual correction by the user.
Influence of nodule diameter on interexamination variability was tested using one-way ANOVA.
In order to detect systematic differences in measured volumes between packages, we performed a mixed model variance analysis of nodule volumes on a subset of nodules that were adequately segmented by all programs.