Background

In basic science and translational studies, time-to-event outcomes are commonly measured imprecisely or incompletely due to limitations in the design or the data collection schema. For example, in mouse studies, due to practical and cost considerations, tumor progression events are not monitored in real time. Each event is recorded as occurring between two contiguous assessments. In clinical studies, drug-induced adverse events are commonly not reported by the date of occurrence but rather by drug cycle. These are examples of grouped failure time data, also referred to as grouped survival data, grouped censored data, or simply grouped data.

It should be noted that grouped failure time phenotypes are not restricted to studies with time-to-event outcomes. Cell-line cytotoxicty studies (e.g., growth-inhibitory effects of tamoxifen on human breast cancer cell lines [1]) are commonly designed on the basis of a pre-specified set of doses. Events, such as the half-maximal inhibitory concentration (IC50), are not directly observable due to the discrete nature of the dose design. The event of interest is observed to occur between two consecutive doses. In the subsequent report, we do not distinguish between time-to-event data and dose-to-event data, and refer to them as failure time data or survival time data.

One approach for dealing with grouped data is to use methods designed to adjust for tied continuous survival times in the presence of right censoring. Commonly used methods are those proposed by Breslow and Peto [2] or Efron [3]. As we will demonstrate in this paper, these methods may be biased when applied to grouped survival data. Another approach is to use an exact likelihood [4]. However, to do so is computationally intensive and thus not feasible for genome-wide inference.

In this paper, we present groupedSurv, an open source R [5] package implementing a statistically rigorous and computationally efficient approach proposed in Prentice and Gloeckler [6] for genome-wide analysis with group censored phenotypes. The package conducts analyses at a variant, gene or pathway level. Through applying groupedSurv to simulated survival data, we establish that our approach controls type I error and yields unbiased effect size estimates. We also show, by simulation, that when methods designed for tied survival outcomes are applied to grouped survival data, the results may be biased. On the basis of results from computational cost benchmarking studies on both real and simulated survival data, we demonstrate that our package completes analyses of large genome-wide association studies (GWAS), e.g., 1,000,000 variants and n=1000 samples, within three minutes on a four-core computer. To illustrate the application of our package, we conduct an analysis of a previously published GWAS to identify common variants associated with taxane-induced peripheral neuropathy in breast cancer patients [7].

Implementation

Statistical considerations

Statistical model

Under the standard Cox proportional hazards model [8], the conditional hazard function at time t>0, given variables xi and zi, is canonically presented as

$$ \lambda(t|\mathbf{x}_{i},\mathbf{z}_{i})=\lambda_{0}(t)\exp\left(\mathbf{x}_{i}^{T} \mathbf{\beta} + \mathbf{z}_{i}^{T} \theta\right), $$
(1)

where λ0(t) is an unspecified baseline hazard function, xi is a vector of variables of interest, and zi is a vector of baseline covariates. Here, β is the parameter vector of interest while θ is a nuisance parameter vector. Under the grouped failure time model, the event of interest will fall into one of r pre-specified time intervals, or be right-censored at the beginning of one of these intervals, denoted [tj−1,tj), for j∈{1,2,…,r}, where t0=0 and tr=. The groupedSurv package employs the regression model for grouped survival data proposed by Prentice and Gloeckler [6], which discretizes the conditional hazard function (1) according to these intervals. Additional technical details for the model are provided in the “Statistical model” section of Additional file 1.

Efficient score statistic

For a single variable of interest, xi, groupedSurv tests the hypothesis H0:β=0 against H1:β≠0, under the grouped failure time model. The test is conducted using an efficient score statistic [9] on the basis of the observed data, {y1=(k1,δ1,x1,z1),…,yn=(kn,δn,xn,zn)}. Here, for sample i∈{1,…,n}, ki∈{1,2,…,r} denotes the interval of the observed grouped outcome and δi∈{0,1} denotes the event indicator. In this notation, ki=j,δi=1 indicates that the event occured in interval [tj−1,tj) while ki=j,δi=0 indicates that the sample was right-censored at time tj−1. Using this notation, the corresponding likelihood function is

$$ L_{i}(\mathbf{y}_{i},\beta,\eta) = \left(1 - \alpha_{k_i}^{\exp(x_{i} \beta + \mathbf{z}_{i}^{T} \theta)} \right)^{\delta_i} \prod_{j=1}^{k_{i}-1} \alpha_{j}^{\exp\left(x_{i} \beta + \mathbf{z}_{i}^{T} \theta\right)}, $$
(2)

where \(\alpha _{j} = \exp \left (-\int _{t_{j-1}}^{t_{j}} \lambda (u)\,du\right)\) and η=(α,θ).

Our inferential approach differs from that described in Prentice and Gloeckler in that we use a partitioned score statistic. More specifically, rather than testing the null hypothesis that all of the effect parameters are zero (H0:β=0 & θ=0), we are testing only the effect of xi (H0:β=0), while treating the parameters of the baseline covariates, θ (as well as the αj) as nuisance parameters. Using the efficient score function for β in the above model, we define our efficient score statistic as

$$ \mathcal W = \frac{\left(\sum\limits_{i=1}^{n} S_{{\beta}}({0,\hat{{\eta}}})\right)^{2}} {n\left(\bar{\mathcal I}_{{\beta\beta}}({0,\hat{{\eta}}})- \bar{\mathcal I}_{{\beta\eta}}({0,\hat{{\eta}}})\bar{\mathcal I}_{{\eta\eta}}({0,\hat{{\eta}}})^{-1} \bar{\mathcal I}_{{\beta\eta}}({0,\hat{{\eta}}})^{T}\right)}, $$
(3)

where \(\hat {\eta }\) is the maximum likelihood estimate (MLE) of the nuisance parameter under the null hypothesis, \( S_{{\beta }}({0,\hat {{\eta }}})\) is the score function with respect to β, \(\bar {\mathcal I}_{{\beta \eta }}({0,\hat {{\eta }}})\) is the second derivative of the score statistic with respect to β and η (similarly for \(\bar {\mathcal I}_{{\beta \beta }}({0,\hat {{\eta }}})\) and \(\bar {\mathcal I}_{{\eta \eta }}({0,\hat {{\eta }}})\)), and n is the sample size. The asymptotic null distribution for the statistic is chi-square, with degrees of freedom equal to the dimension of β. The derivation of this equation and other technical details, including our approach for approximating the standard error of the score statistic, are provided in Additional file 1.

Gene- and pathway-level statistics

Within the context of GWAS, what is often of interest is to conduct the analysis at the level of a gene or pathway rather than an individual variant. To allow for flexibility in conducting these types of set-based analyses, groupedSurv(), the primary function in the groupedSurv package, optionally returns the contribution of each sample to the score statistic for each variant tested. This enables users to employ the set-based statistic of their choice. An additional function, geneStat(), accepts a user-specified function as an argument and computes the gene- or pathway-level statistics directly. If no function is specified, geneStat() implements a sequence kernel association test (SKAT) [10, 11] statistic by default. Technical details are provided in the “Gene- and pathway-level statistics” section of Additional file 1.

Multiple testing

In addition to the unadjusted marginal asymptotic P-values for the variables tested, groupedSurv returns family-wise error rate (FWER) adjusted P-values and local false-discovery rates (FDR). The FWER-adjusted P-values are calculated based on the Bonferroni correction [12], while the local FDRs are based on Storey’s Q-values [13, 14].

Software package

Design

The computational algorithms of groupedSurv are coded in C++ [15], and the Rcpp [16] package is used to interface with the R environment. The Rcpp package provides a series of R wrapper classes to import and load C++ code, and allows passing of R objects between R and C++. The pthreads model is used for implementing multi-threading. In addition to the help files for its functions, a vignette is provided as a user tutorial.

Input data format

The package is designed to accept multiple input data formats, including standard R data frames and matrix objects, and gwaa.data objects from the GenABEL package [17]. Data can also be imported from binary PLINK [18] files using the BEDMatrix package [19]. Genotype dosage data, obtained from imputation software (e.g., MACH [20, 21] or IMPUTE [22]) in the form of VCF [23] files, can be imported using the VariantAnnotation package [24]. The package vignette provides examples for importing each of the data formats. Note that, in the case of missing genotypes or genotype dosage data the package implements a complete-case analysis of each variant.

Usage

The primary function of the package, groupedSurv(), is capable of executing multiple analyses in parallel. It returns a data frame containing the efficient score statistics, along with the (unadjusted) asymptotic P-values, FWER-adjusted P-values, and the FDR for each of the variables of interest.

The thetaEst() function provides MLEs for the nuisance parameters, i.e., the baseline survival rates for each interval, \(\hat {{\alpha }}_{j}\), and the parameters for any covariates, \(\hat {{\theta }}\). The estimates are computed under the null hypothesis, i.e., that the variable of interest has no effect on the time to event. This allows these estimates to be reused to calculate the efficient score statistic for any number of variables being tested.

Evaluation

Survival data simulation

We evaluate the accuracy and computational costs of the implemented methods using simulations. Grouped survival data are simulated by first generating continuous survival times, and then translating these into grouped survival times. First, the variables of interest, X, and two covariates, Z=(Z1,Z2), are simulated. Conditional on the realized values X=x and Z=(z1,z2), the continuous survival time, T, is drawn from an exponential distribution with hazard rate exp(βx+θ1z1+θ2z2). The censoring time, C, is drawn from a uniform distribution over (0,cmax).

To transform continuous times into grouped times, we first specify a final observation time point τ. The interval [0,) is then divided into r contiguous intervals, composed of r−1 finite intervals of equal size spanning [0,τ), and a final rth interval, [τ,). The resulting right-end points of the finite intervals represent the study observation time points. We define T and C as the right-end points of the intervals containing T and C, respectively. Grouped survival time is then defined as \(\tilde {T}=\min (T^{\ast },C^{\ast })\), and the event indicator is given by \(\mathbbm {1}\{T^{\ast }<C^{\ast }~\&~\tilde {T} \le \tau \}\). Technical details for the simulation approach are provided in the “Simulation” section of Additional file 1.

Statistical operating characteristics

We assess the statistical operating characteristics of our approach on the basis of empirical type I error control, statistical power, and bias of the effect size estimates. We also compare these characteristics to those of the Cox proportional hazards model [8], using either the Efron or exact likelihood methods to adjust for tied survival times in the presence of right censoring. We use the implementations of these approaches provided by the survival [25] R extension package.

Benchmarking

We conduct additional simulation studies to assess the computational performance of groupedSurv. To this end, we consider a range of sample sizes and variable counts. We also assess the performance gains from increasing the number of CPU cores used for parallel processing. Ten thousand replicate simulations are conducted for each scenario. The benchmarking analyses are performed on a AMD OpteronTM 6180 SE Server CPU running the Debian Stretch (9.3) AMD64 GNU/Linux. A detailed description of the parameters of each simulation are provided in the “Simulation” section of Additional file 1.

Reproducible pipeline

The knitr [26] R extension package is used to reproducibly conduct the data simulation, summarize the operating characteristics, and estimate the processing benchmarks. The scripts to reproduce the simulation and operating characteristics are provided as Additional file 2.

Data analysis

CALGB 40101 clinical and GWAS data

CALGB 40101 is a randomized phase III study comparing the efficacy of two standard adjuvant therapy regimens in women with breast cancer. It employs a two-by-two factorial design, randomizing patients to paclitaxel versus doxorubicin and cyclophosphamide, and four versus six cycles of therapy. Blood samples from patients who consent to participate in a pharmacogenomic companion study (CALGB 60202) and provide usable DNA are genotyped on the Illumina 610 quad platform. A genome-wide analysis to identify single nucleotide polymorphisms (SNPs) associated with paclitaxel-induced peripheral neuropathy is reported by Baldwin et al. [7]. The authors employ a cumulative dose to first grade 2 or higher paclitaxel-induced peripheral neuropathy event as the phenotype and conduct the analysis using the Cox score statistic for right-censored outcomes. The Efron approximation is used to deal with any ties in the cumulative dose outcome.

The clinical and genomic data are available for download from the database of Genotypes and Phenotypes (dbGaP) through accession phs000807.v1.p1. Additional details about the clinical study are provided in Shulman et al. [27]. Additional details about the GWAS data and neuropathy phenotyping are provided in Baldwin et al. [7].

Analysis of the CALGB 40101 data using groupedSurv

The study population for our analysis consists of 859 genetically estimated European patients identified in the CALGB 40101 GWAS publication. Four patients with no outcome data are uninformatively censored at time 0, effectively excluding them from the analyses. An addition 11 patients were missing baseline covariates or paclitaxel dosing information, and were also excluded.

After limiting the population to genetically estimated European patients with complete data, SNPs with call rates < 95%, Hardy-Weinberg P-values < 10−8, or relative minor allele frequency (MAF) < 0.05 are removed. Analyses are limited to autosomal SNPs only. The filtering is conducted using the GenABEL [17] package. The final genome-wide analysis is conducted across 500,897 SNPs.

Study model

SNPs are tested, under the additive genetic model, for association with cumulative dose to paclitaxel-induced peripheral neuropathy. Cumulative dose is measured as the number of cycles of paclitaxel received (1,2,3,4,5 or 6) prior to neuropathy event or treatment termination. For the analyses, body surface area (BSA) at clinical baseline and age at registration (log base 10 transformed) are used as covariates. GWAS is conducted both with groupedSurv and the coxph() function from the survival [25] package (using the Efron method for ties), and the top hits, ranked by unadjusted P-values from groupedSurv, are compared. The scripts to reproduce the analyses are provided as Additional file 3.

Manhattan and quantile-quantile (QQ) plots are used to visualize the empirical distribution of the resulting unadjusted asymptotic P-values. To illustrate the effect size, we use the non-parametric MLEs of the survival function [28] for interval-censored outcomes provided by the icfit() function from the interval [29] package. For regional visualizations, LocusZoom [30] plots are generated for the selected SNPs or genes.

Results

Statistical operating characteristics

We establish the statistical operating characteristics for our approach by empirical assessment of the type I error and the bias of the effect size for the simulated variant and the nuisance parameters. The type I error simulation results are shown in Fig. 1 for MAFs of 0.05,0.2 and 0.5, and event rates of 0.3, 0.5 and 0.7. Each example is based on a sample size of n=1000 and B=10,000 simulation replicates. These results provide confirmation that our approach provides type I error control.

Fig. 1
figure 1

Box Plot for Type 1 Error Box plot of type 1 error of groupedSurv for different event rates and minor allele frequencies

The simulation results for the empirical bias assessments are shown in Fig. 2 for MAFs of 0.05,0.2 and 0.5. An event rate of 0.6 and a sample size of n=1000 is used for each example, under both the null (effect size β=0) and alternative (effect size β=1) hypotheses. Along with the results from our approach, we provide the corresponding results using the Efron and exact methods of adjusting for ties in a standard right-censored analysis. Each example consists of B=10,000 replicates. Our approach produces evidently unbiased estimates, regardless of MAF. The other two approaches seemingly underestimate the effect size for MAFs of 0.05 and 0.2, while producing unbiased estimates only in the most statistically powerful case when the MAF is 0.5 [31].

Fig. 2
figure 2

Effect Size Estimation Bias for groupedSurv and Coxph Box plot comparing bias of effect size estimation of groupedSurv and Coxph

Having established control of type I error, the simulation results for the power estimations for the grouped failure time method are shown in Fig. 3 for MAFs of 0.05, 0.1, 0.2 and 0.5. An event rate of 0.6 and a sample size of n=1000 is used for each example, with the effect size varying over the range of β∈(−0.9,0.9). Power is estimated at the two-sided α=0.05 level using B=10,000 replicates.

Fig. 3
figure 3

Power for groupedSurv Power estimates for groupedSurv for different minor allele frequencies

These assessments of the statistical operating characteristics, bias, type I error, and power are also repeated for a smaller sample size (n=500). The results are included in Additional file 1.

Computational performance

To assess the computational performance of groupedSurv in terms of parallel efficiency, we consider a sample size of n=1,000, with 100,000 simulated SNPs to be tested individually as the variables of interest. The timing results based on 1, 4, 8, 12 and 16 CPU cores are shown in panel (a) of Fig. 4. The squares represent the observed CPU time using multiple cores while the circles represent the ideal performance, calculated as CPU time using one core divided by the number of CPU cores in each scenario.

Fig. 4
figure 4

Timing benchmark plots for groupedSurv for a different numbers of CPU cores, and b different sample sizes and numbers of SNPs

To evaluate the performance of groupedSurv in terms of sample size and number of variables tested, we consider sample sizes of n=100, 500 and 1000, and SNP counts of 200,000, 500,000 and 1,000,000. The models also include two baseline covariates. The computational costs for the different sample sizes and numbers of SNPs are shown in panel (b) of Fig. 4, based on four CPU cores.

CALGB 40101 GWAS results

The analysis is based on outcome data from 844 CALGB 40101 patients with 500,897 SNPs passing quality controls. An annotated list of the top 300 SNPs, ranked by unadjusted P-value, is provided as Additional file 4. The QQ and Manhattan plots for all SNPs are shown in Fig. 5. The QQ plot does not exhibit evidence of inflation.

Fig. 5
figure 5

QQ and Manhattan Plot for GWAS(a) Quantile-quantile plot of expected − log10(P-values) and observed − log10(P-values) for all SNOs analyzed, (b)Manhattan plot of − log10(P-values) for all SNPs analyzed

None of the 500,897 variants meet the genome-wide threshold of significance of 1 × 10−8. We prioritize the variants according to their corresponding unadjusted P-values for further examination with respect to potential biological relevance. The top 50 hits are shown in Table 1. Figure 6 shows the non-parametric MLEs of neuropathy-free survival by genotype for the top three hits. Survival and regional visualization plots are provided for six additional SNPs in Additional file 1.

Fig. 6
figure 6

Non-parametric Maximum Likelihood Survival Survival plots for the top three SNPs

Table 1 Top hits from CALGB 40101

Discussion

On the basis of results from the application of the groupedSurv package to simulated survival data, we demonstrate the theoretical properties of our approach. More specifically, we confirm that the testing rule provides proper type I error control, and that the model estimates are asympotically unbiased.

The groupedSurv package enables fast genome-wide inference based on grouped censored phenotypes. As Fig. 4 illustrates, on a single core, the analysis of 1,000,000 variants and n=1000 patients requires less than 8.5 minutes. The completion times are reduced to about 2.1, 1.0, 0.7 or 0.5 minutes when increasing the core count to 4, 8, 12 or 16, respectively.

As we have outlined, an alternative approach to using a likelihood model for grouped data is to repurpose methods for adjustments for ties in continuous right censored data. As we and others [31] have shown, these methods may be biased if applied to grouped censored data. The effect of sample size on this bias is illustrated in Figure S1 in Additional file 1.

Although the analysis using this approach did not identify any variants that reached the genome-wide threshold for significance, the annotated results (Table 1) revealed additional candidate genes undiscovered in the initial GWAS. Among the genes listed in Table 1, many show functions in biological pathways critical to neurite regeneration after chemotherapy damage and show modest to high expression in human dorsal root ganglion, the target peripheral nerve damaged by chemotherapy agents [32]. In particular, the genomic region containing rs9435869 is an intriguing hit since both neighboring genes have potential relevance to chemotherapy-induced peripheral neuropathy (CIPN). RHOU is a Rho GTPase that is known to regulate cytoskeletal organization and induce filopodium formation, a critical step in neuronal development [3335]. Furthermore, RAB4A, encoding a member of the RAB GTPase family, is a regulator of vesicular recycling of cell surface receptors through interaction with its effector NDRG1 [36]. Mutations in NDRG1 (N-myc downstream-regulated gene 1) have been shown to be causative in the rare congenital subtype of Charcot-Marie Tooth Disease Type 4D, an inherited peripheral sensorimotor nerve disorder [3739].

Another SNP included in the top 50 hits, rs6481837, annotated to NRP1 (neuropilin 1) which functions to control growth cone projection in developing neurons [40], supports the hypothesis that axon guidance is important in the development of CIPN. This gene pathway was previously implicated with the role of EPHA5 from the initial GWAS [27]. Genes involved in synaptic formation associated to neurodevelopmental disorders of the central nervous system, such as NRXN1 (neurexin 1) and CADM1 (cell adhesion molecule 1), may also have biological relevance in the context of the peripheral nervous system during the manifestation of this drug-induced neurotoxicity [41, 42]. Although these genes/variants are promising for understanding the molecular mechanisms of CIPN, further in vitro investigations are needed.

Here we note several limitations in our approach. First, while the simulation results confirm that the MLEs are asymptotically unbiased, the effect size estimates are model-based and thus may be biased if the model is misspecified. Second, because the phenotype for the CALGB 40101 GWAS analysis is constructed via a manual review of clinical research forms [7], the cycle attribution may be erroneous due to incomplete or missing information. Our approach properly accounts for the group censoring mechanism, however, it may not be able to account for such phenotyping errors. Lastly, in the analysis of the CALGB 40101 data, we assume that the censoring mechanism is uninformative. Although the censoring induced by assignment to four versus six cycles can safely be assumed to be uninformative as it was decided by study randomization, other forms of censoring may be informative, such as early dropouts due to drug sensitivity. If informative censoring is present, the model parameters need to be interpreted within the context of cause-specific hazards.

We conclude this section by considering potential future extensions. In its current form, our approach does not allow for time-dependent covariates. For example, BSA at baseline is a covariate in our analysis of the CALGB 40101 GWAS. BSA, however, varies at each cycle. Prentice and Gloeckner [6] note that it is possible to extend the regression model to incorporate time-dependent covariates. Another extension, that is currently partially implemented, is the ability to analyze data from familial genetic studies, in which the outcomes may be correlated. The package incorporates a kinship frailty model [43], however, the current implementation is sensitive to departures from model assumptions. Therefore, at present we consider it as an experimental feature. A future extension of this implementation is to make it more robust.

Conclusions

Grouped censored phenotypes are prevalent in basic, translational and clinical science research due to the design of, or limitations in, the data collection schema. Failure to properly account for a grouped failure time mechanism may lead to biased analysis results. We present groupedSurv, an open source R package for conducting genome-wide analyses based on grouped survival phenotypes in a statistically principled and rigorous, and computationally efficient manner. In the context of GWAS, the package enables analysis at the variant as well as gene or pathway level. The package is extensively documented and freely available under a public license to the research community.

Availability and requirements

Project name: groupedSurv

Project home page: https://CRAN.R-project.org/package=groupedSurv

Operating system(s): Linux, Windows and OS X

Programming language: R

Other requirements:C++ 11 or higher, and supporting R packages required for installation

License: GPL (≥2)

Any restrictions to use by non-academics: None