Pathology-validated PET image data sets and their role in PET segmentation
- First Online:
- Cite this article as:
- Kirov, A.S. & Fanchon, L.M. Clin Transl Imaging (2014) 2: 253. doi:10.1007/s40336-014-0068-9
- 713 Downloads
Positron emission tomography (PET)/computed tomography has recently been finding broader application for the diagnosis, treatment and therapy assessment of malignant disease. Accurate definition of the tumor border is extremely important for the success of localized tumor therapies. PET promises to provide the metabolically active tumor volume and, at present, it is used for target definition in a variety of tumors. This process is, however, subject to uncertainties of different origin. Resolving these uncertainties is challenging, since validating PET images and segmentation contours against tumor pathology is experimentally difficult. In addition to accurate lesion contouring, this challenges validation of PET tracers and investigations of tumor functional heterogeneity. In this paper, we briefly review the present studies providing PET image data sets with pathology validation. We focus on the specimen handling techniques aimed at achieving higher geometrical accuracy of the pathology-derived “ground truth”. We also summarize the main findings obtained for the PET segmentation techniques which have been tested with the help of these data sets. Finally, we provide a critical summary of the current state of the art in pathological validation of PET images and briefly discuss future possibilities in this direction.
KeywordsPET/CT Histopathology Segmentation
The establishment of various positron emission tomography (PET) tracers as biomarkers in oncology depends on quantification of tracer uptake in PET images. Segmentation of tumors in PET images is one approach to PET quantification and it is needed in localized cancer therapies to define the lesion borders [1, 2] and assess the effects of different treatment approaches. The stakes are especially high for hypo-fractionated radiation therapy, in which lethal doses need to be delivered to lesions that are often next to critical organs. Also, the use of PET/computed tomography (CT) guidance in interventional radiology procedures including tumor ablation  is increasing, and this requires both accurate and quick lesion border determination. Unfortunately, the accuracy of PET quantification is currently low due to the low spatial resolution of PET scanners, other PET imaging artifacts, and lack of “ground truth” for clinical images.
Recent advances in radiation therapy technology have opened the way for high-precision delivery of very high doses of radiation to a previously defined tumor target. Since the technical challenges of radiation delivery and of the immobilization or tracking of the target can be adequately addressed [4, 5], the problem of accurately defining the tumor target remains a major limitation . While at present PET, is the main imaging modality which allows defining the tumor based on its metabolic properties, it has poor resolution and is subject to several artifacts from biological, physical and technical origin . These factors challenge the tumor segmentation process [8, 9]. Similarly, accurate definition of the tumor border is needed in image-guided interventions . An example of this is PET/CT-guided percutaneous ablation, in which the interventional radiologist aims to conform the ablation volume to the PET-avid area in fused PET/CT images [10, 11]. Therefore, experimental verification of lesion margins derived from PET is of great importance.
Delineation of tumors in PET images can be performed manually or automatically. Automatic segmentation reduces inter-observer variability [12, 13]. As a result, many PET auto-segmentation (PET-AS) algorithms have recently been developed and these are reviewed elsewhere [9, 14, 15]. Evaluation of these algorithms can be performed on phantom-based images, simulated images, or on clinical PET images with either manual delineation or pathology validation. Experimental phantom images play a very important role in the initial evaluation of segmentation algorithms and phantom configurations with different degree of complexity have been used. These include different-sized spherical or cylindrical objects with uniform activity concentration similar to the NEMA image quality phantom [16, 17, 18] as well more complex phantoms, which represent real tumor shapes [19, 20]. The phantoms containing simple object shapes which can be filled with activity provide an easier way of producing PET images with ground truth, but the images are not very realistic in terms of tumor shape and non-uniformity of uptake and they are subject to cold wall artifacts [14, 18]. Simulated phantom images overcome this latter limitation and make it possible to produce images of realistically shaped lesions with more realistic activity distributions , either through some of the advanced Monte Carlo tools dedicated to nuclear medicine [22, 23, 24], or through simpler analytical forward projection approaches. A detailed overview of the different experimental and simulated phantoms used for evaluating PET-AS approaches is provided in the upcoming first report of Task Group No. 211 (TG211) of the American Association of Physicists in Medicine (AAPM) .
The major issues with manual delineation as a surrogate of truth are its highly subjective nature and low reproducibility. Although histopathology-derived ground truth may contain geometrical uncertainties, e.g., due to specimen deformation or shrinkage, it provides for a more clinically adequate evaluation of segmentation, since it does not contain the physical and biological biases which may be present in the PET image. Unfortunately, the number of PET image data sets accompanied with histopathology information is very limited, due to the significant difficulties in performing these studies. The growing role of PET prompted the publication during the last decade of several such investigations in which the features of the PET image are compared against histopathology findings of the excised specimen. The present review includes studies on this topic published in peer-reviewed journals identified through PubMed searches. These investigations are reviewed here with respect to their importance for evaluating PET images and segmenting lesions from these images. Many of the reviewed studies compared gross tumor volumes (GTVs) also defined by other imaging modalities: CT, and magnetic resonance imaging (MRI), which are here mentioned only briefly, when considered to be related to the accuracy of the pathology data or the PET images. PET image data sets obtained with the help of experimental and numerical phantoms contribute to resolving the physical sources of uncertainty and their impact on segmentation accuracy, as discussed elsewhere , but they do not address the link between histopathology and PET tracer uptake and are therefore not discussed here. Since in all of the reviewed studies (except one which also uses an additional tracer) the PET radiotracer used is 18F-fluorodeoxyglucose (FDG) throughout the text by PET is meant FDG PET, unless stated otherwise.
PET image data sets with pathological validation
Available data sets
Summary of current publications defining PET image data sets with pathological validation (PSPV)
Number of cases per study with histopathology data
Head and neck
Daisne et al. 
Burri et al. 
18 patients, 27 tumors
(18 primary tumors and 9 cervical nodal metastases)
Caldas-Magalhaes et al. 
Schinagl et al. 
12 patients with head and neck squamous cell carcinoma and 28 metastases in cervical lymph node(s)
Stroom et al. 
Van Baardwijk et al. 
Dahele et al. 
Yu et al. 
Wu et al. 
Wanet et al. 
van Loon et al. 
Meng et al. 
Schaefer et al. 
Zhong et al. 
Han et al. 
Roels et al. 
Buijsen et al. 
Colon and sigmoid
Chen et al. 
35 patients with colon cancer, 42 patients with sigmoid cancer
Zhang et al. 
Sridhar et al. 
52 patients: 22 head and neck, 22 lung, 8 colorectal
Summary of fixation and registration steps and deformation corrections applied in investigations performing detailed tumor shape and volume reconstruction from pathology sections
Daisne et al. /head and neck
Gelatin plus refrigeration sequence
Use wooden rods as fiducial markers
RPC: semi-automated rigid method based on surface segmentation
“Negligible,” according to an animal model
Stroom et al. /NSCLC
Formalin inflation of excised lobes, slicing and embedding in paraffin
RPC: Resliced CT to match orientation of macroscopic slices
5- to 10-mm macroscopic slices, 4-µm, microscopic slices
Deformation of the lung volume measurement: In 3 directions from ratios of the distances from the GTV edge to fiducial markers in the CT and pathology images
van Loon et al. /NSCLC
Same as described by Stroom et al. 
Same as described by Stroom et al. 
Same as described by Stroom et al. 
Same as described by Stroom et al. 
Yu et al. /NSCLC
Fixation of bisected tumor in 10 % formalin for at least 24 h
Orienting the specimen in the “in vivo geometry” and bisecting in the transverse plane
5- to 7-µm sections at 4-mm intervals
From digital photography of the bisected specimen with a ruler before and after fixation
Meng et al. /NSCLC
Specimen fixed in 10 % formalin
5- to 7-mm-thick slices
at 4-mm intervals. Also whole-mount paraffin sections
Schaefer et al. /NSCLC
Caldas-Magalhaes et al. /Laryngeal
Specimen fixed in 10 % formaldehyde for at least 48 h, and put in a box which was then filled with agarose. For histological analysis, the agarose was removed from the slices, which were then decalcified and embedded in paraffin
Three carbon rods were inserted in the specimen craniocaudal direction
RPC: registration between the specimen reconstructed in 3D and the CT of the specimen after fixation; (CTpost) Registration of CTpost with CT
For each 3-mm slice, one 4-µm section was obtained
Shrinking of the specimens mostly occurred during pathology processing to obtain H&E sections from thick sections
Wanet et al. /NSCLC
After extraction, the lobes were inflated with liquid gelatin. Each specimen was placed in a box, which was then filled with gelatin and frozen
Wooden rods were placed in the box
4-mm macroscopic slices
GTV was assumed to be non-deformable
Roels et al. /rectal cancer
Each specimen was placed in a box, which was then filled with gelatin and kept at −20 °C for 2–3 days
Wooden rods were placed inside and around the specimen
2- to 3-mm-thick slices and microscopically thin cross-sections
Corrections for volume shrinkage after fixation by rigid registration of the microscopic to the macroscopic slices
Zhang et al. /cervical cancer
Specimens were fixed with formalin for 24 h
4-mm-thick slices and 4-μm sections
Volume shrinkage before and after formalin fixation of 65–97 % (mean 85 ± 10 %)
3D reconstruction of the pathological volume and image co-registration
Head and neck
Studies involving laryngectomy specimens have been published by Daisne et al.  and by Caldas-Magalhaes et al. . There are both similarities and differences between the methods employed to preserve the shape of the excised larynx lesion. In both studies, the first step was to fix the specimen and to introduce rods as fiducial markers for aligning the slices of the specimen. However, the fixation procedures were different. Daisne et al.  fixed the specimen by placing it in a cast to which a gelatin solution was added, followed by refrigeration intervals at decreasing temperatures (down to −80 °C). This procedure was previously developed and validated against another fixation procedure (formalin), using an animal model. Caldas-Magalhaes et al.  fixed the specimen with 10 % formaldehyde for extended period (at least 48 h) and embedded it in a solution of agarose at controlled temperature after which they cooled it at 5 °C until solidification. In both studies after fixation, the specimen was cut in few mm thick slices (between 1.7 and 2 mm  and around 3 mm ).
Daisne et al.  studied macroscopic slices of the specimen, whereas Caldas-Magalhaes et al.  investigated them at a microscopic level, for which they added an additional step. After removal of the agarose and decalcification, the macroscopic slices were embedded in paraffin and one 4-µm-thick section was obtained for each 3-mm-thick slice and stained with hematoxylin–eosin (H&E). The authors reported that shrinkage of the specimens occurred mostly during the last step of the pathology processing and that there was shrinkage of 12 ± 3 % between the microscopic and macroscopic sections. The extent of the deformations and shrinkage that occurred during surgery and formaldehyde fixation of the specimens was small (3 ± 1 % inside the cartilage skeleton).
Daisne et al.  calculated that the loss of tissue during slicing of the specimen was close to one slice thickness per slice obtained. Caldas-Magalhaes et al.  measured a loss of 2 % for the whole specimen during slicing.
3D image registration
Caldas-Magalhaes et al.  detailed all the steps of the image co-registration process and reported the registration error associated with each step. The H&E slides were rigidly registered with the thick-slide photos, and a scaling factor was applied to these slides. The authors reported that the registration error between the pathology and the CT, MRI and PET images in the cartilage skeleton was on average 1.5, 3.0 and 3.3 mm, respectively. They found that the “GTV was a rigid and compact mass of tissue”, and “that it maintained its shape during the procedure”. They concluded that evaluating the GTV as delineated on the PET image with the GTV derived from the pathology specimen is feasible with an average overall accuracy below 3.5 mm inside the laryngeal skeleton. The delineation inaccuracies were larger than the inaccuracy of the registration error.
Non-small-cell lung cancer (NSCLC)
Lung tissue has a tendency to collapse. To compensate for deformations, in some studies involving NSCLC specimens, the lobe specimens were inflated using different materials and methods. In the method published on lung lobe processing, by Stroom et al. , the lobes were inflated with formalin. Inflation was stopped when the lobes attained a volume as close as possible to the lobe volume seen on the CT. Wanet et al.  used gelatin to inflate the excised lung lobes until they were uniformly filled; and Dahele et al.  insufflated the specimens with 10 % formalin until the specimen was saturated and formalin was being exuded across the pleura.
However, in the three studies, significant deformations of the specimen from the in vivo status were found. Dahele et al.  reported that formalin was expelled from the lobes during cutting them into macroscopic sections, resulting in additional deformation which needs to be accounted for. To overcome this problem they embedded some of the specimens in agar before sectioning and tested several cutting methods. They reported that cutting with an electric rotor cutter improved the consistency of the sections and that embedding in agar was helpful in some of the cases.
Large deformations were observed between the CT and the macroscopic images of the specimen [27, 33], and were found to be anisotropic . A similar observation was made by Siedschlag et al.  for a 10-mm-thick layer around the GTV “depending on circularity of the tumor and orientation of the specimen on the pathology table during processing.” Gravity was expected to deform the specimens in a direction perpendicular to the table. Stroom et al.  mentioned that: “the volume of the well-inflated lung lobes on pathologic examination was still, on average, only 50 % of the lobe volume on CT.”
Other studies dealt with lung lobes, but tried a different technique, which did not involve inflating the specimen. For those studies, radiology–pathology image co-registration was not performed. Yu et al.  oriented the specimen to the in vivo geometry and bisected it in the transverse plane in the operating room. They took photographs of the specimen, both before and after fixation in 10 % formalin as well as after slicing the specimen with a microtome into 5- to 7-μ-thick slices to determine the volume correction. They reported a reduction to 82 ± 10 % of the original tumor volumes (range, 62–100 %) before and after fixation with formalin. Meng et al. , in a follow-up study, fixed the specimens in formalin and subsequently sliced them to obtain whole-mount, H&E-stained slides after which they examined the ME. They did not correct for shrinkage as a result of fixation with formalin even though they had measured this in their previous study  and point that it may affect MEmax. Schaefer et al.  processed the specimens immediately after extraction, so formalin was not used and shrinkage was not considered. The specimens were sectioned into slices ranging from 4 to 5 mm in thickness and manual contouring of the macroscopic tumor extension area was performed for each slice. The accuracy of the technique was not reported.
3D image registration
Stroom et al.  found that the CT-to-pathology deformation factors for their study were linear, anisotropic and ranged from 1.0 to 2.4 (average 1.8) over all three directions (Fig. 2). In this study, rigid corrections were applied to the pathology specimens to correlate them with pre-surgery scans. The authors also took the maximal ME for every patient and multiplied it by the deformation factors. Wanet et al.  rigidly registered the pathological volume with the CT and PET images. Dahele et al.  developed a method for 3D correlation of PET/CT images and whole-mount histopathology in NSCLC. They described qualitatively their experience in registering 3D PET/CT images with pathology and concluded that there “is no one definitive method for 3D volumetric” radiology–pathology correlation (RPC) in NSCLC and that using “large histopathology slides to whole-mount entire sections for digitization” allows rigid and manual registration of histopathology reconstructions to CT and PET. They also pointed out that “timing between imaging and surgery and the use of respiratory-correlated PET and CT imaging” will become factors for robust RPC .
Roels et al.  put each rectum specimen in a box immediately after extraction; wooden rods were placed inside and around the specimen for orientation and reconstruction purposes. The box was filled with a gelatin solution and stored at −20 °C for 2–3 days to freeze it. Slices with thickness of 2–3 mm were obtained and fixed in formaldehyde. Microscopically thin cross-sections of the tumor were then obtained and registered with the photos of the macroscopic specimen. Microscopic slices were corrected for the shrinkage that occurred during the fixation step and the GTV was delineated on these microscopic slides.
Zhang et al.  aimed at determining the optimal SUV cutoff for FDG PET scans of patients with cervical cancer by matching the volume measured on the extracted specimen to GTVPET. The pathology procedure includes fixing the extracted specimen with a 10 % formalin solution, cutting it into serial slices of 4 mm thickness and embedding them in paraffin. The macroscopic slices were then cut into 4-μm-thick histological sections and stained with H&E. They measured the tumor volume before and after formalin fixation and reported volume shrinkage of 65–97 % (mean 85 ± 10 %).
In addition to the investigations mentioned above, there exist a large number of investigations in which the size or volume of the tumor was estimated from the surgical specimen without slicing it. The methods employed are discussed below.
In a head and neck study, Burri et al.  estimated the pathological tumor volumes from the maximal 3D lengths of each tumor after resection and compared them with the volumes measured on the PET and CT images. The 3D diameters were also measured and used to calculate the pathological ellipsoid volume by Sridhar et al.  for head and neck, lung, and colorectal tumors. Schinagl et al.  measured the volume of lymph node metastases from head and neck cancer using water immersion after removal of perinodal and fatty tissues.
The studies listed below compared one or more of the tumor dimensions of the surgical specimen from the patient with the corresponding lengths observed on CT, PET and in several cases also MRI scans. For esophageal squamous cell carcinoma, Zhong et al.  measured the gross tumor length on PET images and compared it with the tumor length as measured from the pathology specimen. The length of the esophagus was measured in vivo before removal. They corrected for the deformation of the surgical specimen by stretching it to the length as measured in vivo. The gross tumor length was then measured. Han et al.  followed this procedure for esophageal squamous cell carcinoma, but they added a fixation step of the specimen with 10 % formaldehyde. They then cut 0.5-cm-width tissue strips, and measured the longitudinal tumor length. They did not report correcting for shrinkage after fixation. The gross tumor lengths were compared to the lengths derived from FDG- and fluorothymidine- (FLT) PET images. To the best of our knowledge, this is the only PSPV investigation that also included a radiotracer other than FDG.
For NSCLC lesions, van Baardwijk et al.  measured the maximal diameter (MD) by macroscopic examination of extracted lung tumors. Wu et al.  inflated and fixed the lung lobes for 12–24 h in 10 % neutral-buffered formalin. They measured the MD by macroscopic examination of sections of the specimens obtained at 3- to 5-mm intervals.
For rectal cancer, Buijsen et al.  measured the length of the tumor macroscopically with a ruler before slicing. Chen et al.  investigated maximum tumor diameters in colon and sigmoid cancer. Measurements on the specimen were performed after fixation in formalin and prior to slicing. They mentioned that they did not correct for possible shrinkage due to fixation in formalin.
Role of the data sets in tumor target determination
Since the focus of the present review is the role of pathology-determined tumor borders in the segmentation of PET images, comparison with CT- and MRI-determined tumor volumes is mentioned only where it relates to PET volumes. We group the results into three categories: comparisons of tumor volume sizes, evaluations of the accuracy of segmentation tools, and findings regarding the location of the tumor extensions with respect to the segmented volumes. The results obtained using the data sets in each of these categories are briefly summarized below separately for each body location both from the reviewed articles as well as from subsequent investigations using the respective data sets (Table 3). To fully evaluate the value of the findings summarized below the reader should consider the limitations of the specimen handling and image registration procedures (where applicable) described in detail in the original articles.
As a general rule, the automatic PET segmentation tools including those discussed below should not be directly used in the clinic. Mistreatment could occur due to large variations between clinics and patients. Validation of the segmentation tools for each particular PET/CT scanner, scanning protocol, body location, disease type as well as careful review and editing of the tumor contours by an experienced physician for each patient are needed.
Head and neck lesions
Tumor volume comparisons
The main findings for laryngeal squamous cell carcinoma (LSCC) published by Daisne et al.  were that the GTV volumes were significantly smaller when determined from the surgical specimen than when determined from CT, MRI and FDG PET. At the same time, the macroscopic tumor extensions were not completely covered by any of the three imaging modalities. Caldas-Magalhaes et al.  reached similar conclusions, since they also found that the average GTVs determined from CT, MRI and PET (GTVCT, GTVMRI and GTVPET) were all larger than the average GTV determined from pathology (GTVpath) and that GTVPET was the closest to GTVpath, but that CT and MRI provided better tumor coverage. Burri et al.  also reported that the tumor volumes they measured on PET images are generally smaller than those measured on CT.
Evaluation of PET segmentation methods
Types of PET segmentation methods tested against histopathology for different tumor types or body locations
Tumor type or location
Types of PET segmentation methods tested against pathology
Head and neck
Possibility theory 
Multimodality using level sets 
Non-small-cell lung cancer (NSCLC)
Fuzzy C-means 
Active contours 
Neural network 
Multimodality using level sets 
Colon, rectal and sigmoid cancer
Geets et al.  used seven cases of the Louvain LSCC laryngeal data set  to test the validity of a gradient-based segmentation method. They found that when applied on denoised and deblurred images this gradient-based method was more accurate than the SBR method also used above , although it did not totally cover the macroscopic tumor volume. Belhassen et al.  used the same set of seven cases  to compare the performance of three fuzzy C-means (FCM) clustering algorithms and found that incorporating à trous wavelet transform to improve accuracy for heterogeneous cases results in more accurate delineation. These authors also reported that all three techniques failed to fully encompass the macroscopic tumor volumes. Abdoli et al.  also used the Louvain LSCC data set  to compare a contourlet-based active contour PET-AS tool aimed at accounting for the noise and heterogeneity of PET images and found it to be superior to adaptive threshold and two FCM methods. Zaidi et al.  used the Louvain LSCC data set  to compare the performance of nine algorithms including five threshold methods, a level set method, a stochastic expectation–maximization method, fuzzy clustering-based segmentation (FCM) and a spatial wavelet-based FCM (FCM-SW) and found FCM-SW to be most accurate. Markel et al.  also used the Louvain LSCC data set  to evaluate a multimodality segmentation tool using level sets and Jensen-Renyi divergence (JRD). They compared the results to those from Zaidi et al. , and found that the JRD approach was second to the FCM-SW method.
A possibility theory-based PET-AS tool, the 42 % threshold, and two adaptive threshold methods [46, 52] were tested and compared for the LSCC data set  by Dewalle-Vignon et al. . The authors demonstrated the “validity” of their possibility theory approach, which was developed to account for the inherent uncertainty and accuracy in the PET images, with respect to the other methods tested, but remarked that the method “does not globally result in superior results to that of some adaptive thresholding.”
SUV thresholds were tested for head and neck lesions by Burri et al.  and Schinagl et al. . Burri et al.  determined the pathology volume from “the maximal tridimensional lengths of each tumor” and found that the default SUV threshold of their software and narrowing the SUV “window” by one standard deviation were most likely to underestimate the tumor volume, while a SUV of 2.5 was most likely to overestimate it, and that a threshold at “40 % or greater maximum” “appears to offer the best compromise between accuracy and reducing the risk of underestimating tumor extent.” Schinagl et al.  compared several PET-AS tools (SUV = 2.5, two fixed threshold and two adaptive threshold) to volumes of lymph node metastases from head and neck cancer and found that the last four tools performed worse if the primary tumor was used as a reference. They did not see an advantage to adding PET for lymph node segmentation, but did recommend using a PET-AS tool for improving reproducibility and comparison between institutions for therapy planning and assessment.
Evaluation of PET segmentation methods
Zhong et al.  found “that the optimal PET method to estimate the length of gross tumor varies with tumor length and SUVmax; an SUV cutoff of 2.5 provided the closest estimation in this study,” when compared to visual interpretation and 40 % of maximum SUV.
Han et al.  segmented their FLT PET images using visual delineation and several thresholds (SUV cutoffs of 1.3, 1.4, 1.5, and taking 20, 25 and 30 % of the SUVmax). For their FDG PET image segmentation they used: visual delineation, SUV 2.5, and 40 % of SUVmax. They used the same specimen stretching procedure as Zhong et al.  and found that an SUV cutoff of 1.4 for FLT PET and 2.5 for FDG PET gave GTV lengths closest to pathology.
Tumor volume comparisons
Similarly to the investigations for head and neck tumors, for NSCLC, Schaefer et al. , reported that both CT and PET overestimated the pathological tumor volume and that the PET volume was closer to it. Interestingly, GTVpath was less than GTVPET for all the patients included in that study. They also found significant differences between the PET and pathology volumes in the lower lobe, but not so for the upper lobe. Wanet et al.  also found that FDG PET provided an average volume that was closer to GTVpath, when compared to CT, but not for all patients. In four of the patients studied by Stroom et al. , GTVPET values were 13, 7, 7, and 24 ml, while GTVpath values were 6, 4, 8, and 39 ml, respectively.
Evaluation of PET segmentation methods
Most of the original lung studies evaluated threshold and adaptive threshold methods against NSCLC pathology volumes. Schaefer et al.  found a correlation of an adaptive threshold algorithm (which uses the mean SUV above 70 % of SUVmax and background as parameters) with the pathology findings. Yu et al.  performed a search to identify a SUV that would result in the best match for GTVpath. They found that “The mean (±SD) %SUV and absolute SUV that produced the best agreement between GTVpath and GTVPET were 31 ± 11 % and 3.0 ± 1.6, respectively.” In addition, they found that “the optimal threshold was inversely correlated with GTVpath or tumor diameter.”
Wanet et al.  evaluated gradient-based, adaptive threshold and fixed threshold PET-AS methods and found that a gradient-based method outperformed threshold-based techniques and also that there was “no statistical difference between the different imaging modalities and delineation methods” by performing volume matching analysis using the Dice similarity coefficient. Abdoli et al.  also used nine patients from the Louvain lung case data  to evaluate their active contour PET-AS tool and found it to be superior to the other methods they tested, as they also found for the laryngeal cases above.
The MAASTRO NSCLC data set  was used by several research groups. Van Baardwijk et al.  evaluated an automatic SBR-based PET-AS method and showed it to result in good correlation with pathology measurements and in reduction of the inter-observer variability. This data set was also used by Hatt et al.  to study the impact of tumor size and heterogeneity on the delineated volume. They found that the Fuzzy Locally Adaptive Bayesian (FLAB) algorithm (designed to account for image uncertainty due to noise as well as image blurring due to limited resolution) gave results closer to pathology than the 50 % of the maximum PET intensity threshold, T50  and an adaptive threshold method . They also found that for more heterogeneous tumors the threshold-based techniques more strongly underestimated the tumor volumes and suggested that such methods should not be used for large heterogeneous NSCLC 18F-FDG PET images.
The same data set  was also used by Belhassen et al.  to test the three FCM clustering algorithms, which they also tested against the laryngeal lesions (above). They found that the wavelet transform-enhanced FCM resulted in a smaller mean error of the maximal diameter estimation also for the NSCLC lesions. Markel et al.  also used the MAASTRO NSCLC data set  to evaluate their multimodality segmentation tool using level sets and JRD and found that JRD outperformed an SBR method when using only PET and noted further performance improvement when information from both PET and CT is used. Sharif et al.  used the MAASTRO data set  to evaluate an artificial neural network approach.
Wu et al.  contoured automatically GTVs on PET images at 20, 30, 40, 45, 50, and 55 % of the maximal intensity level. They found that GTVCT correlated better with pathology than GTVPET and that one of their CT window and level settings and a PET threshold of 50 % of the maximum level “had the best correlation with pathologic results.”
Microscopic tumor extensions
Few of the studies reported ME findings. Stroom et al.  found that MEmax, defined as the maximum of the minimum distances from the GTV to each ME islet for each patient, varied between 0 and 9 mm before deformation correction (average 5 mm) with an average of 9 mm after the correction. A follow-up of this study published by van Loon et al. , using the same specimen processing and registration procedures, further examined ME for NSCLC and found an association of mean CT tumor density and GTVCT with the presence of ME. Using a statistical model, they divided the patients into two groups with high and low probability of ME and found that the mean CT number and GTVCT are significant predictors of ME presence. They also found that GTVPET (automatically delineated using a 42 % threshold of the maximum SUV) as well as GTVCT accurately represent the Clinical Target Volume determined from pathology, CTVpath, for patients with low risk of ME, but that both GTVCT and GTVPET underestimate CTVpath for patients with high risk of ME, on average by 19.2 and 26.7 mm, respectively. Meng et al.  determined the maximal ME from all islets for each patient without considering direction. They found that MEmax was significantly correlated with SUVmax and the metabolic tumor volume (MTV). To cover 95 % of ME, they suggested margins varying between 1.93 and 9.60 mm depending on SUVmax.
Colon, rectal and sigmoid cancer
In rectal cancer, Roels et al.  compared the closeness of GTVPET (obtained with adaptive threshold and gradient-based segmentation methods) and GTVMRI to GTVpath. They found that GTVPET obtained with the gradient-based segmentation was closer to GTVpath than GTVMR or GTVPET obtained with the adaptive threshold method. They also reported a spatial discordance between MRI- and PET-based tumor volumes of approximately 50 %, which could be in part related to rectal filling with MRI contrast.
Buijsen et al.  found that rectal tumor lengths determined by a SBR-based PET-AS method show the strongest correlation with lengths measured on pathology, compared with tumor lengths determined from the CT and MRI images. Chen et al.  tested segmentation thresholds at 20, 30, 40 and 50 % of SUVmax and found that a 30 % threshold of the PET maximum uptake provides an adequate tumor length and width for tumors in the colon and in the sigmoid.
Sridhar et al.  tested several threshold and a gradient segmentation methods for segmenting head and neck, lung and colorectal tumors and found the gradient method to have “superior correlation and reliability with the estimated ellipsoid pathologic volume.”
Zhang et al.  searched for optimal segmentation thresholds and found that for their 10 cervical cancer patients the optimal percent and absolute SUV thresholds were 40.50 ± 3.16 % and 7.45 ± 1.10, respectively. They also found that the optimal percent SUV threshold was inversely correlated with GTVpath and tumor diameter and that the SUV threshold was positively correlated with SUVmax.
Discussion and conclusions
Summary and critical analysis of the literature
Using histopathology results of excised lesions to validate PET images is challenging. Therefore, efforts in this direction, including the papers summarized in this review, provide indispensable data toward solving the dilemma of how PET images should be used to define the tumor volume. While all these investigations contribute toward finding a solution the problem, the most valuable are those that manage to provide an estimate of the 3D shape of the lesion based on pathology, since in addition to providing information about the tumor volume or diameter they may also locate the border of the lesion in the PET image. This was achieved through fixation of the specimen through freezing , inflation [27, 28] and/or placement in formalin  followed by corrections for tissue retraction and/or deformation. Despite these meticulous efforts to preserve the shape of the lesion after excision, fixation and slicing, the accuracy of the respective corrections for shrinkage and deformation and their effect on the validation accuracy is investigated only in a few studies [25, 26, 27, 28, 29, 35]. H&E staining was used in the studies in which microscopic histopathology analysis was performed.
Since the volumetric studies require less processing of the specimen, they present the possibility of having a larger number of patients and thus better statistics. These studies, however, do not provide sufficient information for strict evaluation of the segmentation methods, since as reported by Daisne et al. , even if similar in size, the GTV from the PET image may not overlap with the pathology volume.
Practically all the investigations reviewed in this paper used 18F-loaded FDG except one (Han et al. ), which investigated both FDG and FLT PET. Also, most of the studies considered only GTVpath [25, 33, 34, 38, 56], but a few also evaluated the PET-derived GTV against both the GTVpath and the CTVpath [27, 28]. While the additional comparison with the CTVpath further complicates the investigation, it is very valuable in providing the CTV tumor margin, which in addition to being disease- and location- dependent may also be anisotropic.
The summarized studies also differ in how, and how much, GTVpath was used for evaluating various PET segmentation approaches. An upcoming review which lists the segmentation tools evaluated against PSPVs will be presented in the first TG211 report . The majority of the pathology-validated PET image data sets reviewed here were originally presented by their authors in conjunction with some segmentation contours, although evaluating the contouring method may have not been their primary goal. The segmentation tools evaluated against the pathology in the original publications were mostly simple threshold or adaptive threshold methods. More advanced segmentation methods, which promise to be able to handle realistic tumors with irregular shape and non-uniform activity, have been evaluated in later publications against some of the PSPVs reviewed here [47, 48, 49, 50, 51, 53, 54, 55]. Important conclusions have been reached for these more advanced methods, as pointed out in the previous section. At the same time, the TG211 report  points about PSPVs that “several sources of error in the production of these data sets should be acknowledged: (1) deformation of the surgical specimen after excision, (2) time difference between the PET scan and the specimen excision, (3) imperfect delineation of metabolic boundaries in digitized histopathology, and (4) imperfect co-registration between histopathology and PET image spaces.”
As pointed out above, a few of the laryngeal and NSCLC investigations made the interesting observation that the deformation of the excised lesion can be neglected. Due to insufficiency of the data provided it is difficult to assess the meaning and accuracy of this statement, especially when observing the difference in lesion shape between the macro-specimens and the CT in Figs. 1 and 2. Probably what was meant was that the deformations of the lesions were much smaller than those of the surrounding soft tissues. As pointed out by many of the investigators, deformations both during fixation and slicing are possible.
Possible directions for improvement are to increase registration accuracy and reduce the time between patient scan and lesion excision. Applying correction factors for changes in the specimen during the fixation process  is necessary, although for some (e.g., laryngeal, cortical bone) specimens these changes may be small or negligible. Providing an estimate of the accuracy of the deformation correction factors is also desirable to verify that the level of accuracy is sufficient for evaluating PET segmentation methods. Free-breathing, non-gated PET scans were acquired in most studies except one , where the PET scan was gated. In some cases, the time between PET and surgery was long enough (up to 3 weeks) to expect tumor changes.
Due to these potential sources of error, as well as the difficulty in accumulating more PSPVs, many of the PET-AS methods have also been tested against expert delineation or images from simulated and experimental phantoms as described in several reviews [9, 14, 15]. These reviews also list other very promising and advanced methods, which, to our knowledge, have not yet been tested against histopathology-based ground truth. This cannot be considered as a disadvantage of such PET-AS methods, bearing in mind the more accurate registration of the ground truth with PET images for experimental and numerical phantoms. Despite this, given the many factors that can affect a clinical image and may not be exactly represented in the simulations (e.g., biological uncertainty, image noise, etc.); testing the segmentation methods on clinical images with some type of pathology-based ground truth is highly desirable. As pointed out in the upcoming TG211 report, PET segmentation should ideally be evaluated against a combination of phantom and clinical images with reliable ground truth in a standardized way.
When considering the results from evaluation of various PET segmentation methods, it is very important to consider the PET scanners and protocols used in the different studies, since they may significantly affect the PET image and therefore the segmentation results. These differences between the scanner, protocol and procedures used by different institutions should be investigated and the accuracy of the segmentation algorithm should be tested by each user and the method adapted to his/her particular setting before using that algorithm for radiotherapy planning . In general, validation of PET contours against the pathology-defined ground truth aims to resolve modifications of the PET image due to both physical artifacts and biological phenomena. However, since the physical artifacts are scanner- and protocol- dependent and the biological phenomena are patient-dependent, translating the results from published pathology validations to different patients in different institutions remains a challenge. Current efforts to standardize imaging protocols [7, 58, 59], as well as the work of several task groups (e.g., AAPM TG 174), will reduce differences between PET images due to physical factors, but this does not address patient-specific biological variations.
Despite the significant contribution of the investigations contributing PSPV, their number remains small and insufficient due to the substantial experimental burden and difficulties in producing pathology-based definitions of lesions and in registration of the pathology-derived ground truth with the PET image. The problem is compounded by the fact that variation of tumor type, stage and location in the body often results in large variations in the level and heterogeneity of PET tracer uptake in the tumor and in the surrounding healthy tissues. In addition, the recently observed heterogeneity of genetic mutations  introduces practically infinite degrees of freedom for the tumor genetic identity, which may also manifest in different metabolic representation. Therefore, continuing these efforts may be strongly affected by confirmation of the hypothesis that cancers of the same type have a common metabolic representation. More data of the kind summarized in this review, but with the addition of the extra dimension of tumor genetic mutations, will need to be accumulated to address this hypothesis .
In a recently published study, Axente et al.  proposed another alternative for generating pathological data sets for PET segmentation validation. They tested their approach in a small animal model. It consisted of injecting a mouse with 14C-FDG, which was sacrificed 80 min post injection. The tumor was then extracted and sliced. An autoradiography of the slices was acquired to image the activity distribution in the tumor, and a 3D reconstruction of the radiotracer distribution was performed. A PET scan was simulated based on the tracer uptake distribution. This method appears very promising to improve the accuracy of the pathology-based ground truth in the PET images, since the registration error was found to be very low.
Another opportunity for accumulating such data is to correlate the histopathology of the biopsy specimen obtained under PET/CT-guided biopsies  with the PET image. Such investigations would have the advantage of high spatial accuracy due to the visibility of the biopsy needle in the PET/CT image. In addition, performing autoradiography of the biopsy specimen provides an opportunity to determine the tracer distribution with higher spatial resolution than PET . The data which can be obtained by such studies are limited to the point of biopsy needle insertion. This, however, may be partly compensated for by the large number of biopsy procedures performed and their routine use in oncology. Correlations of the specimen histopathology with the PET/CT image obtained during a biopsy procedure in the operating room would be spatially more accurate than current investigations. However, even for specimens extracted under CT guidance, correlations with patients’ PET scans prior to the biopsy might also provide useful data, albeit with less spatial accuracy.
If the hypothesis described above is resolved and sufficient PET-histopathology correlations are accumulated for different tumor types, this may allow for more reliable definition of the lesion border from the PET image for localized therapies.
The authors would like to acknowledge the contribution of Dr Ellen Yorke, Ph.D. of the Department of Medical Physics at Memorial Sloan-Kettering Cancer Center (MSKCC) in New York, of Dr Heiko Schöder, M.D. Department of Radiology, MSKCC, of Dr. Mathieu Hatt, INSERM UMR 1101, LaTIM, Brest, France and of Dr. Andre Moreira, Department of Pathology, MSKCC, for their helpful comments on the manuscript. The authors acknowledge the support of the Department of Medical Physics at MSKCC and of Biospace Lab, S.A.
Conflict of interest
Dr Assen Kirov has a research grant from Biospace Lab, S.A., which partially supports the work of Ms Louise Fanchon.
Human and Animal Studies
This article does not contain any studies with human or animal subjects performed by any of the authors.