Introduction

Multiple images (e.g. X-rays) are often collected in epidemiological, medical, and other kinds of research in subjects over time or between conditions. Often the goal is to determine when and where statistically significant differences occur between conditions. Before evaluating changes between images, all images are co-registered both within and between subjects, e.g. to a single template [14]. Then, a statistical map [57] is created, consisting of t- (2 conditions) or F-statistics (>2 conditions) at each image unit, typically a pixel (2-dimensional) or voxel (3-dimensional). While direct comparison of the t-maps or F-maps, i.e. of changes of individual pixels, is possible using statistical parametric mapping (SPM) [8], such comparison may suffer from low statistical power after proper adjustment for multiple comparisons with family-wise error. More importantly, those pixels (or voxels) with significant changes can be distributed sparsely or be clinically or biologically irrelevant to a given application. Instead, a cluster of contiguous pixels or voxels is usually more informative and robust, in particular for the study of bone changes due to altered weight bearing conditions [9].

As an alternative, a suprathreshold cluster analysis (STCA) [10] determines the statistical significance of clusters with changes beyond a suprathreshold. STCA includes the following steps: First, it selects regions of interest (ROI), which are clusters of contiguous pixels with t- or F-values usually above the 95th percentile of the empirical distribution of the observed t- or F-statistics. Second, it uses permutation tests and selected cluster features (e.g. “size”) [5] to determine the family wise statistical significance of candidate clusters (ROIs). STCA has been successfully applied in neurological studies [1113]. The main advantages of STCA over SPM are to avoid selecting isolated pixels due to extreme values and to reduce the number of comparisons from pixels to ROIs. By determining the ROIs, the chosen threshold heavily influences the overall conclusions [13, 14]. STCA still uses a fixed “primary” threshold to construct clusters and then determines their significance. The consistency of current method in selecting these thresholds, however, is not ideal [15]. Smith and Nichols [16] proposed an alternative to a fixed threshold approach using the average of p values from all possible thresholds according to their distributional weights as the summary p values [16]. However, their method is beyond scope of this paper, and debates about selecting thresholds and testing for differences continue [13, 14, 16].

The aims of our paper are to propose a new cross-validation (CV) method to select the optimal threshold for forming candidate clusters, and to assess the significance of those clusters via a permutation test. We demonstrate our new method with an application to a study of accelerated bone loss of astronauts during long-term spaceflight.

The paper is organized in the following way. First, we provide background to the astronaut pre- and post-spaceflight study that motivated this project (Sect. 2). Using the terminology set in Sect. 2, we present detailed descriptions of our theory and methods (Sect. 3), which we then apply to the astronaut data (Sect. 4). We present our conclusions and discussion in the final section.

A study of bone loss during long-duration spaceflight

A longitudinal study of bone loss

Our research was motivated by a study of accelerated bone loss of astronauts during spaceflight [17, 18]. Astronauts experience localized bone mineral loss during extended periods of weightlessness, for example in the proximal femur [18]. Methodologies for detecting regions that experience greatest bone loss due to spaceflight may inform the study of changes in bone density due to long-term physical inactivity, aging, disease, drug treatment, and other causes. We present this study first to provide details of SCTA in order to understand our improvements in Sect. 3.

Quantitative computed tomography (QCT) scans of the hip were taken for 16 astronauts (44.6 ± 4.0 years old) prior to and after their 4–6 months spaceflight on the international space station (ISS). The study protocol was approved by the institutional review boards (IRBs) of the national aeronautics and space administration (NASA), Baylor College of Medicine, and the University of California, San Francisco (UCSF). Pre-flight scans were performed 30–60 days prior to launch, and post-flight scans were performed within 7–10 days of landing. Helical CT images (GE Hispeed Advantage GE Medical System, Milwaukee, WI) were acquired at Methodist Hospital, Baylor College of Medicine, at a scan setting of 80 KVp, 280 mAs, 3-mm slice thickness, helical pitch of 1, and in-plane spatial resolution of 0.9375 mm. The pre- and post-flight scans of the 16 astronauts were co-registered (including rigid and non-rigid co-registration) to a common reference space so that the homologous tissue elements could be compared [19, 20]. After image co-registration, one middle coronal slice with 114 × 151 = 17,214 pixels in each scan was used for this study. Bone mineral density (BMD) was measured in Hounsfield Units (HU, a quantitative measure of radiodensity) under pre-flight (A) and post-flight (B) conditions, and then the matched pixel differences were compared between the two conditions. For this study, our analysis focused on the region of the proximal femur, which consisted of 3,948 pixels. Though we only have access to 2-D data, the methods described in this paper can be equally extended to 3-D voxel data.

Generation of space-flight statistical parametric maps

Let A i (k, l) and B i (k, l) be the pre- and post-flight BMD measured in HU, respectively, at pixel coordinate (k, l) for astronaut i (i = 1, 2, …, I, I = 16). In this study we assume that only bone loss (not gain) occurs during spaceflight and therefore only consider one-sided differences A–B. Hence, a difference image or difference map between pre- and post-flight scans is computed for individual astronauts at each pixel (k, l) as

$$ \Updelta_{i} (k,\,l) = A_{i} (k,\,l) - B_{i} (k,\,l)\left( {i = 1, \cdots ,16} \right). $$
(2.1)

General formulas for the mean and variance of difference images are the following:

$$ \bar{\Updelta }(k,\,l) = \frac{1}{I}\sum\limits_{i = 1}^{I} {\Updelta_{i} (k,\,l)} , $$
(2.2)

and

$$ S^{2} (k,\,l) = {\frac{1}{I - 1}}\sum\limits_{i = 1}^{I} {\left( {\Updelta_{i} (k,\,l) - \bar{\Updelta }(k,\,l)} \right)^{2} } $$
(2.3)

where I is the total number of astronauts. With the assumption that the true error variance is spatially smooth, we use a smoothed variance estimator, S ′2, based on the Gaussian kernel of full width at half-maximum (FWHM 1.5 pixel) to decrease the noise in variance estimation [21].

Our objective function here is the mean difference of two conditions with mean statistics or t-statistics. Using these smoothed variance estimates, we calculate pixel-level pseudo t-statistics as [10, 21]

$$ T(k,\,l) = {{\bar{\Updelta }(k,\,l)} \mathord{\left/ {\vphantom {{\bar{\Updelta }(k,\,l)} {\sqrt {S_{{}}^{\prime 2} (k,\,l)} }}} \right. \kern-\nulldelimiterspace} {\sqrt {S_{{}}^{\prime 2} (k,\,l)} }}. $$
(2.4)

While the pseudo t-statistics used in this paper as an index of change are not independent between pixels, the conventional approach approximates them to t-statistics with degrees of freedom of I −1. In the remainder of the paper we drop the “pseudo” qualifier and denote T as the range of t-statistics in T 0.

Optimum suprathreshold selection

Cluster forming

A cluster is a set of spatially connected pixels sharing similar features and based on a t-map of I subjects. We consider a cluster as a set of connected pixels with {(k, l): T(k,l) ≥ u}, where u is a certain threshold. The connected neighbour region of pixel (k, l) is \( \{ (k - 1,l - 1), \, (k,l - 1), \, (k + 1,l - 1), \, (k - 1,l), \, (k + 1,l), \, (k - 1,l + 1), \, (k,l + 1),{\text{and }}(k + 1,l + 1)\} \).

Many discrete clusters can be formed within a t-map, even including a single isolated pixel. By altering the threshold u, we change the number and size distribution of clusters identified in a t-map. By determining which clusters become candidates for significance testing, threshold selection can have a strong influence on the results of any image analysis.

Cross-validation

Researchers face a constant challenge in trying to identify valid thresholds for constructing candidate clusters [10, 22]. Although a common approach arbitrarily uses the 95th percentile in t-statistics, there is no algorithm to provide an automatic threshold selection strategy to systematically identify candidate clusters. Clusters that experience true bone loss between conditions A and B should have higher values of T(k, l). Thus, the optimum clusters derived from the current data also should have the largest mean difference value Δ for future astronauts in the same bone region.

We therefore propose the use of cross-validation methods to choose the optimum suprathreshold u * c  ∈ T. The basic idea of cross-validation (CV) is to randomly split a data set D (of total size I) into K mutually exclusive subsets D 1, D 2, ··· D K of approximately equal size. The clusters based on a threshold u c are then formed using K − 1 subsets by excluding D i (denoted as \( D\backslash D_{i} \)). We can test the effect of these newly formed clusters on the excluded subset D i that was not used to construct the clusters. Repeating the procedure K times, with each subset used exactly once for validation, constitutes a K-fold CV [23]. A tenfold CV [2325] is often considered sufficient. When K equals I, the number of observations in the original sample D, the procedure is known as leave-one-out cross-validation (LOOCV). Here, the full dataset D is the set of all images from all astronauts pre- and post-flight.

To expedite the search for the optimal suprathreshold, we first define a search range for u c . Let u L and u H be the low and high bounds for u c , respectively. We begin with an initial threshold of u 1 = u L and follow by iteration at u n  = u n−1 + Δu until we reach u H . Here Δu is an acceptable tolerance for error in the optimal u. In our example, we defined the 80th–99th percentiles from the original distribution of T as the search range and used a half percentile for Δu.

Proposed precedure for u * c

Our procedure for selecting the optimal suprathreshold u * c is as follows (Shown in Fig. 1):

Fig. 1
figure 1

Graphical illustration of the cross-validation procedure

  • Step 0: Create the t-map T 0 for the full dataset D.

  • Step 1: Partition the data D into K mutually exclusive subsets D i , i = 1, 2, …K.

  • Step 2: Leave out subset D i and use \( D\backslash D_{i} \) to create the t-map T i . For the current u n , define candidate clusters as all clusters C j with t-statistics above u n in T i . Calculate mean difference for all pairs of pixels in the clusters or \( {\text{ROIs}}\left( {i,u_{n} } \right) \) for subset D i :

    $$ {\text{ROIs}}\left( {i,u_{n} } \right){ = }\left\{ {C_{j} :\forall (k,l) \in C_{j} ,T_{ - i} (k,l) \ge u_{n} } \right\} $$
    (3.1)
    $$ m_{i} \left( {u_{n} } \right) = {\frac{{\sum\limits_{{\left( {k,l} \right) \in {\text{ROIs}}\left( {i,u_{n} } \right)}} {\Updelta_{i} (k,l)} }}{{\left| {{\text{ROIs}}\left( {i,u_{n} } \right)} \right|}}} $$
    (3.2)

    where \( \Updelta_{i} (k,l)\) is defined in (2.1) and \( \left| {{\text{ROIs}}\left( {i,u_{n} } \right)} \right| \) is the number of pixels in \( {\text{ROIs}}\left( {i,u_{n} } \right) \).

  • Step 3: Repeat Steps 2 to 4 until the Kth sample has been excluded.

  • Step 4: Get the summary cross-validation (CV) statistics for the K models as objective function:

    $$ {\text{CV}}(u_{n} ) = \frac{1}{K}\sum\limits_{i = 1}^{K} {m_{i} \left( {u_{n} } \right)} $$
    (3.3)

    Take u 0 to be the 50th percentile of T 0 as a baseline, normalize CV(u n ) to accommodate for differences in scale between images using CV(u 0) as:

    $$ {\text{NCV}}(u_{n} ) = {\frac{{{\text{CV}}(u_{n} ) - {\text{CV}}(u_{0} )}}{{{\text{CV}}(u_{0} )}}} $$
    (3.4)
  • Step 5: Repeat Steps 2 to 7 over all candidate clusters (ROIs) and derive

    $$ u^{\prime} = \mathop {\arg \max }\limits_{{u_{n} }} {\text{NCV}}(u_{n} ) $$
    (3.5)

    and finally choose

    $$ u_{c}^{*} = \min \left\{ {u:{\text{NCV}}(u) \ge {\text{NCV}}(u^{\prime}) - {\text{SE}}\left( {{\text{NCV}}(u^{\prime})} \right)} \right\} $$
    (3.6)

    where \( {\text{SE}}\left( {{\text{NCV}}(u^{\prime})} \right) \) is the standard error (SE) of NCV(u ) K-fold CV samples.

Here, the statistics m i (u n ) are used for clusters with single or multiple pixels. In this application we use a mean difference instead of a mean of the t-statistics because we will use LOOCV, and the pooled standard deviation of D i is not available. With sufficient numbers of subjects in the CV subsets (>2), t-statistics for each pixel can be calculated, and the mean of t-statistics can be used to replace Δ i (kl) in (3.2). This percentage improvement measure in (3.6) is unitless and less dependent on different constructions of CV subsets. Equation (3.6) is the 1-SE rule originally recommended by Brieman et al. for CVs [25] and adopted by many authors in evaluations of CV errors [26, 27], in particular in recursive partitioning analysis. It recognizes that candidate thresholds within 1-SE range from the optimal u in (3.5) most likely will result in comparable NCV (u) ‘s to the optimal NCV (u′). By lowering the suprathreshold to 1-SE in (3.6), we will get slightly larger size clusters with more stable feature statistics, yet not sacrifice the efficiency to measure changes. Past experience and simulation studies suggested that the 1-SE rule can screen out noise in finite sample CVs [26, 28].

Once we identify the optimal suprathreshold, the remaining challenge is to determine which clusters represent significant change beyond chance. Traditional permutation tests [5, 6] could be used to derive the permutation distribution, thereby eliminating the need to assume a specific distribution for the test statistics [10, 21]. Consider two conditions pre- and post-flight as A and B, and data from I subjects follows as ABABAB···. Then rearrange the labels randomly within subjects to get another sequence maybe as BA AB BA ···, And a new t-map T r could be derived for each time (r = 1, 2, …, R).

Cluster size is the most sensitive to choice of thresholds, which is a simple statistics counting the number of connected pixels (in cluster) above the threshold u,

$$ Y_{j}^{S} \left( u \right) = \sum\limits_{{\left( {k,l} \right) \in C_{j} \left( u \right)}} {1_{T(k,l) \ge u} } , $$
(3.7)

where 1T(k,l)≥u is a binary indicator for pixels with t-statistics above u. For each permutation time r, find the cluster above u * c with maximum size Z r in T r . For each cluster C j (u * c ) in T 0, calculate how many Z r is larger or equal to the size of cluster C j and get the empirical p value as \( p_{j} = {{\# \left( {Z_{r} \ge Y_{0,j} } \right)} \mathord{\left/ {\vphantom {{\# \left( {Z_{r} \ge Y_{0,j} } \right)} R}} \right. \kern-\nulldelimiterspace} R} \). If p j  ≤ α (e.g., α = 0.05), we conclude that cluster C j (u * c ) displays statistically significant differences (in the feature statistics) between conditions A and B.

Application to the study of bone loss during long-duration spaceflight

As described in Sect. 2, scans of the hip were taken for 16 astronauts before and after their 4–6 months spaceflight. After co-registration of these images [19], paired t-statistics were calculated for each pixel to form an observed t-map T 0 in Fig. 2. This set of t-statistics had the empirical distribution T shown in Fig. 3a.

Fig. 2
figure 2

Original T 0 and 28 candidate clusters by CV. Twenty-eight (28) candidate clusters were circled with red color for intensity in t statistics. The top five clusters are numbered in the figure

Fig. 3
figure 3

Histogram of t-statistics T and NCV-threshold curve. Red and green dotted lines correspond to the 90th and 95th percentile t values, respectively, of T0

Because of the small sample size (I = 16 subjects), we used LOOCV (described in Sect. 3) to determine the optimal suprathreshold u * c , and a search range of 80th-99th percentiles and a tolerance level Δu of 0.5%. The maximum NCV(u ) was 0.52, achieved at the suprathreshold u of 3.41, which corresponds to the 93rd percentile of the distribution of T. The standard error (SE) of NCV(3.41) was 0.1 (Fig. 3b), which led to the optimum superathreshold u * c as 3.14, or the 90th percentile of T, according to Eq. (3.6).

With this optimal u * c , 28 clusters were formed within the original t-map T 0, as represented by red regions in Fig. 2. Five major clusters ranged in size from 9 to 55 pixels. The remaining 23 smaller clusters had mean t-statistics 3.45 (SD 0.17) and mean size of 3.7 pixels. A permutation test based on 1,000 permutations was then used to yield p values for cluster size Y S j (j = 1, …, 28) that are shown in Table 1. Under control of type I error with α = 0.05, we identified five clusters based on Y S j as sites with statistically significant bone loss. The remaining clusters were not statistically significant.

Table 1 Cluster size and p values for five major clusters and remaining smaller clusters

By comparison, using the conventional 95th percentile threshold for T produced 22 clusters. The p values for Y S j of these clusters are also presented in Table 1. Compared with the 95th percentile of T, the optimal suprathreshold u * c (90th percentile of T) produced the same significant results for cluster size for each cluster.

Discussion and conclusion

In this paper, we propose statistical improvements to the suprathreshold cluster analysis (STCA) framework for longitudinal image comparisons. While STCA has been used in neurological imaging research, particularly in functional brain imaging, its application to other imaging areas is less common. We hope our study of bone loss in astronauts during long-term spaceflight will support the general application of this statistical tool, and our extensions of it, to diverse biological systems.

As an alternative to STCA, Statistical Parametric Mapping (SPM) is a more commonly used method to compare longitudinal changes in images and identify clusters or ROIs. SPM assumes a Gaussian Random Field (GRF) and common variance structures across subjects, which are difficult to verify [21]. The main advantage of STCA is that it is a non-parametric method that does not require special assumptions about spatial or intra-subject longitudinal and biological correlation structures. Permutation tests have been widely used for high dimensional data, especially in genomics [29] and functional neurological image analysis [10, 30], and can be applied not only to longitudinal changes in individuals, but also to (cross-sectional or longitudinal) comparisons of groups by permuting group assignments.

The main contribution of this paper was the use of cross-validation (CV) to select the optimum suprathreshold. CV methods are more independent of the type of input data and special image analysis than other methods that rely on untenable or unverifiable assumptions about statistical distributions. Comparing with traditional 95th percentile method, the clusters detected by CV trend to be with bigger size.

We performed a simulation study to demonstrate improved statistical power and efficiency of this method, which added artificial clusters with known intensity and size changes into an assigned region of the image. Compared with the conventional 95th percentile threshold, the clusters identified by CV tend to be larger in size. Especially for low intensity, the 95th percentile threshold sometimes divided one cluster into two or more sub-clusters, which decreased the homogeneity within clusters and resulted in insignificant changes. Results are not shown but are available from the authors.

In this application, we wanted to identify the specific location(s) of greatest bone loss within the hip during long-term spaceflight, and therefore used a one-sided change of bone loss as our CV metric. For more general longitudinal applications to a null hypothesis of no change and the alternative hypothesis of any change in either direction (e.g. gain or loss of bone), we can use the absolute t-map as the CV metric.

While the statistical methods of this paper can be used for either 2-D or 3-D images, our demonstration is confined to the 2-D (pixel-based) case. Extension to 3-D (voxel-based) images and other types of digital images, e.g. satellite remote sensing, photography, or astronomy, should be straightforward.

Astronauts incur bone loss during long-duration spaceflight, and it is reasonable to expect that the majority of bone loss occurs in areas that are subject to greatest mechanical stress under earth’s gravity. Understanding the spatial heterogeneity in loss of proximal femoral bone tissue, in which the largest losses concentrate in the load-bearing subregions, is of interest to general mammalian biology as well as for the well-being of astronauts after their return to gravity. Some research has shown that bone adapts to earth’s gravity by increasing the size of cortical bone but not necessarily trabecular bone [18]. Knowing the nature of most significant bone loss will help devise preventive measures during spaceflight as well as rehabilitation interventions post-spaceflight.

In summary, this paper proposed a cross-validation (CV) method to select the optimum suprathreshold and form candidate pixel clusters (or ROIs) of longitudinal changes of images and provided one method solve the problem for a fixed “primary” threshold in STCA.