Continuous and discrete-time survival prediction with neural networks

Due to rapid developments in machine learning, and in particular neural networks, a number of new methods for time-to-event predictions have been developed in the last few years. As neural networks are parametric models, it is more straightforward to integrate parametric survival models in the neural network framework than the popular semi-parametric Cox model. In particular, discrete-time survival models, which are fully parametric, are interesting candidates to extend with neural networks. The likelihood for discrete-time survival data may be parameterized by the probability mass function (PMF) or by the discrete hazard rate, and both of these formulations have been used to develop neural network-based methods for time-to-event predictions. In this paper, we review and compare these approaches. More importantly, we show how the discrete-time methods may be adopted as approximations for continuous-time data. To this end, we introduce two discretization schemes, corresponding to equidistant times or equidistant marginal survival probabilities, and two ways of interpolating the discrete-time predictions, corresponding to piecewise constant density functions or piecewise constant hazard rates. Through simulations and study of real-world data, the methods based on the hazard rate parametrization are found to perform slightly better than the methods that use the PMF parametrization. Inspired by these investigations, we also propose a continuous-time method by assuming that the continuous-time hazard rate is piecewise constant. The method, named PC-Hazard, is found to be highly competitive with the aforementioned methods in addition to other methods for survival prediction found in the literature.


Introduction
Survival analysis considers the problem of modeling the distribution of the time to an event. A plethora of statistical methods for analyzing time-to-event data, and especially right-censored survival data, have been developed over the last fifty years or so. Most of these methods, like Cox regression, assume continuous-time models, but methods based on discrete-time models are sometimes used as well. Statistical methods for continuous-time survival data are discussed in a number of textbooks, such as Klein and Moeschberger (2003). The literature on discrete-time models and methods is more limited, but the recent book by Tutz and Schmid (2016) provides a nice presentation.
An important part of survival analysis is the topic of time-to-event prediction, also denoted survival prediction. This generally concerns predicting when an event will occur for new individuals (not part of our training set), where each individual is defined by a vector of covariates. Time-to-event prediction is most commonly approached by predicting the survival function for each individual, meaning we provide an estimate of the event time distribution conditioned on each individual's covariates. In standard survival analysis, Cox regression is often used for this purpose (Klein and Moeschberger 2003, Chapter 8.6).
As a result of the rapid development in machine learning, and in particular neural networks, a number of new methods for time-to-event predictions have been developed in the last few years. This development has benefited from the excellent frameworks for neural network development, such as TensorFlow, PyTorch, Theano, Keras, and CNTK, which have simplified the application of neural networks to existing likelihoodbased methodologies. Thus, novel methods for time-to-event predictions have been developed based on Cox's partial likelihood (e.g., Katzman et al. 2018;Luck et al. 2017;Yousefi et al. 2017;Kvamme et al. 2019) and the likelihood for discrete-time survival data (e.g., Lee et al. 2018;Fotso 2018;Gensheimer and Narasimhan 2019).
In survival analysis, continuous-time models are arguably more commonly applied than their discrete-time counterparts. However, as neural networks are parametric models, fully parametric models are typically more straightforward to integrate with the neural network frameworks than semi-and non-parametric models. This makes discrete-time survival models, which are fully parametric, interesting candidates to extend with neural networks when developing methods for time-to-event predictions.
To the best of our knowledge, Lee et al. (2018) were the first to apply modern neural networks to the discrete-time likelihood for right-censored data. Their prediction method, denoted DeepHit, parameterizes the probability mass function (PMF) of the survival distribution with a neural network and combines the log-likelihood for right-censored data with a ranking loss for improved discriminative performance. In statistical survival analysis, it is, however, more common to express the likelihood by the hazard rate. Gensheimer and Narasimhan (2019) used this form of the likelihood and parameterized the discrete-time hazard rate with a neural network. They showed that their prediction method performs well, both in terms of discrimination and calibration of the survival predictions. However, they did not compare their methodology, which they refer to as Nnet-survival, with methods that parameterize the PMF.
One aim of the present paper is to perform a systematic study of the use of neural nets in conjunction with discrete-time likelihoods for right-censored time-to-event data. In particular, we perform a systematic comparison of methods that parameterize the PMF and the discrete hazard rate.
More importantly, we show how methods that are developed for discrete-time survival data using neural networks, may be adopted as approximations for continuoustime data. In this way, we circumvent the problem that continuous-time survival models are not so easily adapted to the neural net frameworks. To this end, we have to perform a discretization of the continuous time scale; a subject that has received little attention in the literature. Immediately, it might seem reasonable to have a very fine-grained discretization scheme, allowing the approximate discretized event times to be very close to the true event times. However, the number of parameters in the neural networks typically increases with finer discretization schemes, possibly making the network overfit. Therefore, there is a balance to be considered between the number of parameters in the network and the approximation error introduced by the discretization. We consider two discretization schemes, corresponding to equidistant times or equidistant survival probabilities, and conduct a simulation study to better understand the effect of the discretization scheme and the number of time points used for the discrete-time methods.
Closely related to the discretization of a continuous time scale is the subject of interpolation. A coarse discretization grid has the benefit of reducing the number of parameters in a neural network. But the approximation error that incurs when a discrete-time method is used as an approximation for continuous-time data, becomes smaller with a finer discretization grid. By interpolating the discrete-time survival predictions, one may use a coarser discretization grid with less of an impact on the approximation error of the predictions. For this reason, two interpolation schemes are investigated in this paper. The first assumes constant density functions between the time points in the discretization grid, and the second assumes constant hazard rates between the grid points. As a modification of the latter method, we also propose a continuous-time method obtained by assuming that the continuous-time hazard rate is piecewise constant, and we compare this method with the aforementioned discretetime methods with and without interpolation.
The paper is organized as follows. First, in Sect. 2, we consider the discrete-time likelihood for right-censored event times and discuss how the likelihood may be parameterized with neural networks. Then, in Sect. 3, continuous-time models for time-to-event data are considered, and we discuss how discretization of the continuous time scale enables the use of discrete-time survival methods for continuous-time data. Here we also present the two schemes for interpolating discrete survival functions, and we consider our continuous-time method obtained by assuming piecewise constant hazards. In Sect. 4, a simulation study is conducted to understand the impact the discretization and interpolation schemes have on the methods, and in Sect. 5, we compare the methods with existing methods for time-to-event predictions using five real-world data sets. Finally, we summarize and discuss our findings in Sect. 6. Some additional material on the simulations and the implementation of the methods are provided in the Appendix. The code for all methods, data sets, and simulations presented in this paper are available at https://github.com/havakv/pycox.

Discrete-time models
In this section, we will restrict ourselves to models in discrete time. Then, in Sect. 3, we will discuss how discrete-time models may be used as approximations of models in continuous time.

The discrete-time survival likelihood
Consider an individual described by its covariate vector x ∈ R q . Assume that time is discrete with values 0 = τ 0 < τ 1 < . . ., and let T = {τ 1 , τ 2 , . . . } denote the set of positive τ j 's. The time of an event is denoted T * ∈ T, and our goal is to model the conditional distribution of this event time given the covariate vector x. The probability mass function (PMF) and the survival function for the event time are defined as (1) In survival analysis, models are often expressed in terms of the hazard rate rather than the PMF. For discrete time, the hazard rate is defined as and it follows that Note further that from (3) it follows that the survival function can be expressed as In most studies, we do not observe all event times. For some individuals, we only have a right-censored observation. To allow for censoring, we let C * ∈ T C = {τ 1 , τ 2 , . . . , τ m } be a right-censoring time. Here τ m defines the maximum follow-up time, at which all individuals still at risk are administratively censored. The random variables T * and C * are typically not observed directly, but instead we observe a potentially right-censored event time T = min{T * , C * } and an event indicator D = 1{T * ≤ C * }. We here follow the common convention in survival analysis that when an event and censoring time coincide, we observe the occurrence of the event. Note that, as C * ≤ τ m , we are not able to observe event times T * larger than τ m . Hence, we are restricted to model the distribution of the event times in T C .
We assume that T * and C * are conditionally independent given x, and that their distributions have no parameters in common. Then we can consider, separately, the contribution to the likelihood of the event time distribution and the censoring distribution. We are, however, typically only interested in modeling the event time distribution. Now, considering a set of n independent individuals, each with covariates x i , event or censoring time t i , and event indicator d i , the likelihood contribution of each individual i is given by Using this, we can fit models by minimizing the mean negative log-likelihood A useful reformulation of the loss function (6) is obtained by rewriting it in terms of the discrete hazards. To this end, let κ(t) ∈ {0, . . . , m} define the index of the discrete time t, meaning t = τ κ(t) . Using (2), (3), and (4), we can then rewrite the likelihood contribution (5) as With this formulation, the mean negative log-likelihood in (6) can be rewritten as Here, y i j = 1{t i = τ j , d i = 1}, so y i = (y i1 , . . . , y iκ(t i ) ) is a vector of zeros with a single 1 at the event index κ(t i ) when t i corresponds to an observed event (d i = 1). We recognize this as the negative log-likelihood for Bernoulli data, or binary crossentropy, a useful discovery first noted by Brown (1975). With the two loss functions (6) and (7), we can now make survival models by parameterizing the PMF or the discrete hazard rate and minimizing the corresponding loss. For classical statistical models, these approaches are equivalent and have been used to obtain maximum likelihood estimates for the parameters in the PMF/hazard rate; see Tutz and Schmid (2016) for a review. We will, however, not consider classical maximum likelihood estimates, but focus on the part of the literature that fits neural networks for the purpose of time-to-event prediction, in which case the two loss functions may give different results.

Parameterization with neural networks
A neural network φ(x) ∈ R m is a parametric, differentiable function of a covariate vector x ∈ R q that minimizes a loss function using a gradient descent approach. While networks typically contain thousands or millions of parameters, simple models such as linear and logistic regression can also be considered neural networks. For a large number of parameters, we are usually not interested in the parameter estimates themselves, but only in the network's predictive capabilities. While there is a vast literature on various ways to parameterize neural networks, the internal structure of the networks is not that relevant for this paper as we only consider the most standard multilayer perceptron networks, or MLP's. So, for the purposes of this paper, we think of the network φ(x) ∈ R m as some very flexible parametric function of the covariates x. For more on MLP's and neural networks in general see, e.g., the book by Goodfellow et al. (2016).
In the previous subsection, we saw that the survival likelihood can be expressed in terms of the PMF or the hazard rate. In the following, we will describe how to use this to create survival prediction methods by parameterizing the PMF or hazard with neural networks. In theory, as both approaches aim at minimizing the same negative log-likelihood, the methods should yield the same results. But due to the nature of neural networks, this might not be the case in practice. Contrary to most parametric statistical models, neural networks are typically overparameterized and a minimum is not obtained for the training loss. Instead, a held-out validation set is monitored, and the iterative optimization procedure is stopped when performance on this validation set starts to deteriorate. Also, considering that neural networks are well known to be sensitive to numerical instabilities, some parameterizations of a likelihood might result in better performance than others.
First, considering the hazard parametrization of the likelihood, let φ(x) ∈ R m represent a neural network that takes the covariates x as input and gives m outputs. Each As the discrete hazards are (conditional) probabilities, we apply the logistic function (sigmoid function) to the output of the network to ensure that h(τ j | x) ∈ (0, 1). We can estimate the hazard rate by minimizing the loss (7), and survival estimates can be obtained from (4). To the best of our knowledge, this method was first proposed by Gensheimer and Narasimhan (2019). However, if one considers the special case where φ j (x) = β T x, the approach is well known in the survival literature and seems to have been first addressed by Cox (1972) and Brown (1975); see also Allison (1982). The book by Tutz and Schmid (2016) gives a review of the approach. The implementation we use in the experiments in Sects. 4 and 5 differs slightly from that of Gensheimer and Narasimhan (2019), as it was found to be numerically more stable (see Appendix B). In this paper, we will refer to the method as Logistic-Hazard, as coined by Brown (1975), but one can also find the term Logistic Discrete Hazard used in the statistical literature. Gensheimer and Narasimhan (2019) referred to it as Nnet-survival, but we will refrain from using that name as we find Logistic-Hazard to be more descriptive.
We can obtain a survival model by parameterizing the PMF in a similar manner to the Logistic-Hazard method. As for the hazards, the PMF f (τ j | x) represents probabilities, but, contrary to the conditional probabilities that define the hazard, we now require the PMF to sum to 1. As we only observe event times in T C , we fulfill this requirement indirectly through the probability of surviving past τ m . Thus we have Now, again with φ(x) ∈ R m denoting a neural network, the PMF can be expressed as Note that (9) is equivalent to the softmax function (also used in multinomial logistic regression) with a fixed φ m+1 (x) = 0. Alternatively, one could let φ m+1 (x) vary freely, something that is quite common in machine learning, but we chose to follow the typical conventions in statistics. By combining (1) and (8), we can express the survival function as for j = 1, . . . , m − 1, and Notice the similarities to classification with m + 1 classes, as we are essentially classifying whether the event is happening at either time τ 1 , . . . , τ m or later than τ m . However, due to censoring, the likelihood is not the cross-entropy. Instead, by inserting (9) and (10) into (6), we get the mean negative log-likelihood where κ(t i ) still denotes the index of individual i's event or censoring time, that is, . This is essentially the same negative log-likelihood as presented by Lee et al. (2018). Note, however, that contrary to the work by Lee et al. (2018) the negative log-likelihood in (11) allows for survival past time τ m . Some numerical improvements of the implementation are addressed in Appendix B. We will refer to this method simply by PMF as this term is unambiguously discrete, contrary to the term hazard which is used both for discrete and continuous time.
As a side note, the Multi-task logistic regression (Yu et al. 2011), and the neural network extension of this method (Fotso 2018), can be shown to be a PMF model by considering a cumulative sum of the linear predictor, or in the neural network case, the cumulative sum of the output of the network. Details are given in Appendix C.

Continuous-time models
In the following, we no longer consider the time scale to be discrete, but instead consider continuous-time models, where T * , C * > 0, and we let T = min{T * , C * } and D = 1{T * ≤ C * } be as before. Let τ denote the maximum possible value of C * , so that P(C * ≤ τ ) = 1. Hence, a potentially right-censored observation T is in the interval (0, τ]. Instead of a PMF, we now have the density function f (t | x) and the continuous-time survival function Furthermore, the continuous-time hazard rate is a non-negative function of the time (no longer restricted to [0, 1]), As a result, we can express the survival function in terms of the cumulative hazard This yields the continuous-time version of the likelihood contribution in (5), In what follows, we will first discuss how we can apply the discrete-time methods from Sect. 2.2 for continuous-time data. We will here address how time can be discretized to fit the discrete-time model formulation, and how to interpolate an estimated discrete survival function for continuous-time predictions. Then, we will propose a new continuous-time method by assuming that the hazard in (12) is piecewise constant.

Discretization of the time scale
Both the PMF and Logistic-Hazard methods require time to be discrete on the form 0 = τ 0 < τ 1 < · · · < τ m . Hence, to apply the methods to continuous-time data, we need to perform some form of discretization of the time scale. Possibly the most obvious way to discretize time is to make an equidistant grid in [0, τ ] with m grid points. An alternative, that we explore in this paper, is to make a grid based on the distribution of the event times. By disregarding covariates and estimating the marginal survival function S(t) with the Kaplan-Meier estimator, we obtain a general trend of event times. WithŜ(t) denoting the Kaplan-Meier survival estimates, we can make a grid from the quantiles of the estimates, 1 =Ŝ(0) = ζ 0 > ζ 1 > · · · > ζ m =Ŝ(τ ). We will assume that each interval has the same decrease in the survival estimate, so that ζ j − ζ j+1 = (1 −Ŝ(τ ))/m. The corresponding time grid, τ 1 < · · · < τ m , is obtained by letting τ j be the smallest value of t such thatŜ(t) ≤ ζ j . We will then obtain a more dense grid in intervals with more events, and a less dense grid in intervals with fewer events. This is illustrated in Fig. 1, where we can see that the grid becomes coarser as the slope of the survival curve becomes less steep.
The discrete-time methods assume that all events and censorings occur at the τ j 's, so, when performing the discretization, we move all event times in an interval to the end of that interval while censored times are moved to the end of the previous interval. This means that for Our reason for this choice is that this is typically how the times are recorded. Consider a study where we are only able to make observations at times τ 1 < τ 2 < · · · < τ m . For a censored observation, τ j−1 is the last point in time where the individual was recorded alive, while for an observed event, τ j is the first point in time for which the individual was recorded with the event.

Interpolation for continuous-time predictions
When discrete-time survival methods are applied to continuous-time data, the survival estimates become a step function with steps at the grid points; see the blue curve in Fig. 2. Consequently, for coarser grids, it might be beneficial to interpolate the discrete survival estimates. To this end, we propose two simple interpolation schemes that fulfill the monotonicity requirement of the survival function. The first assumes that the probability density function is constant in each time interval (τ j−1 , τ j ], while the second scheme assumes constant hazard in each time interval. We will refer to the schemes as constant density interpolation (CDI) and constant hazard interpolation (CHI). Note that the two interpolation schemes correspond to piecewise linear and piecewise exponential survival estimates, as illustrated in Fig. 2.

A piecewise constant continuous-time hazard parametrization
We now propose a continuous-time method by parameterizing the hazards in (14). As for the constant hazard interpolation, we will let the continuous-time hazard be piecewise constant. Disregarding neural networks, this model was first proposed by Holford (1976), and further developed by Friedman (1982) who found that piecewise constant hazards yield a likelihood proportional to that of a Poisson likelihood.
Consider a partition of the time scale 0 = τ 0 < τ 1 < · · · < τ m = τ , and let κ(t) denote the interval index of time t such that t ∈ (τ κ(t)−1 , τ κ(t) ] (this is slightly different from the discrete case where we had t = τ κ(t) ). If we assume that the hazard is constant within each interval, we can express the hazard as the step function h(t | x) = η κ(t) (x) for a set of non-negative functions {η 1 (x), . . . , η m (x)}. For τ j = τ j − τ j−1 , we can now express the cumulative conditional hazard as (14) yields the likelihood contribution for individual i,

Inserting this into
What remains is to parameterize the hazard with a neural network. However, to avoid passing all the τ j 's to the loss function, we let the network instead parameterize the functionsη j (x) = η j (x) τ j . This allows us to rewrite the likelihood contribution as is the proportion of interval κ(t) at time t.
As before, let φ(x) ∈ R m denote a neural network. To ensure thatη j (x) is nonnegative, we could have usedη j (x) = exp[φ j (x)]. However, for better numerical stability we prefer to use the softplus functioñ Now, again considering n independent individuals, each with covariates x i , observed event or censoring time t i , and event indicator d i , our model can be fitted by minimizing the mean negative log-likelihood and estimates for the survival function can be obtained by where ρ(t) is given by (15). We will refer to this method as the piecewise constant hazard method, or PC-Hazard. Even though this is a continuous-time method, we still need to decide the set of τ j 's that define the intervals. Therefore, the discretization techniques discussed in Sect. 3.1 are also relevant for this method.

Simulations
To get a better understanding of the methodologies discussed in Sects. 2 and 3, we perform a simulation study where we vary the size of the training sets, the discretization scheme, and the number of grid points used for discretization. Gensheimer and Narasimhan (2019) performed a similar study to evaluate the effect of discretization on their Logistic-Hazard method with the conclusion that there were no differences in performance. However, their simulations were quite simple (only one binary covariate), their only performance metric was the Harrell Jr et al. (1982) concordance at 1-year survival, and they did not include any interpolation of the survival estimates.
For this reason, we find that further investigations are warranted. We generate simulated survival times by sequentially sampling from discrete-time hazards defined on a fine grid of time points. The hazards are specified through their logit transforms, as this enables us to use functions in R while still obtaining hazards in (0, 1). The logit hazards, where , for i = 1, 2, 3.
Each of the three functions g sin (t | x), g con (t | x), and g acc (t | x) are constructed to give a specific contribution to the hazards: g con (t | x) gives a constant hazard for a set of covariates, g acc (t | x) allows for a hazard that increases with time, and g sin (t | x) enables periodic patterns in the hazards. With this combination, we are able to represent a variety of event time distributions. Each function γ k (x) uses five of the covariates in x, totaling a covariate vector x of size 45. The functions γ k (x) are defined in Appendix A.
We let the discrete time scale consist of 1,000 equidistant points between 0 and 100 so that τ 0 = 0, τ 1 = 0.1, τ 2 = 0.2, …, τ 1000 = 100. Knowing the hazards, the true survival function can be obtained with (4). In Fig. 3 we show five examples of logit hazard rates and their corresponding survival functions. Note that even though we simulate our data using a discrete-time model, the time-grid is so fine that this mimics simulation from a continuous-time model. The full details of this simulation study are given in Appendix A.

Simulation setup, hyperparameter tuning, and evaluation
We created three training sets of size 3000, 10,000, and 50,000, a validation set of size 10,000 (for hyperparameter tuning), and a test set of size 100,000. For the training and validation sets, we included a censoring distribution with constant hazard resulting in 37% censoring. The full uncensored test set is used for evaluation. For the discretization of the time scale, we applied both the equidistant scheme and the Kaplan-Meier quantiles, each with 5, 25, 100, and 250 grid points. The three prediction methods under investigation, described in Sects. 2.2 and 3.3, all consist of the same "base" neural network φ(x) ∈ R m , but have individual output layers and functions transforming φ(x) to survival estimates. These final transforms do, however, not include any parameters or hyperparameters. For a fair comparison, we therefore perform the same hyperparameter search of φ(x) for all models. As an extensive search is unfeasible for such models, we only consider MLP networks with ReLU activation functions, batch normalization, and dropout between each layer. The size of the networks is controlled by the number of layers and the number of nodes in each layer, which we assume to be the same for all layers. Our choices are motivated by an attempt to represent the most standard MLP's. For more on MLP's and neural networks in general see, e.g., the book by Goodfellow et al. (2016).
We performed a hyperparameter grid search over 1, 2, 4, and 8 hidden layers; 16, 64, and 256 nodes; and dropout of 0 (no dropout) and 0.5. Each net was trained with a batch size of 256 and the AdamWR optimizer (Loshchilov and Hutter 2019) with cycle length 1, where, at each restart, the cycle length was doubled and the learning rate was multiplied by 0.8. Learning rates were found using the methods proposed by Smith (2017). The methods' respective loss functions (negative log-likelihoods), computed on the validation set, were used for selecting the best set of hyperparameters. The hyperparameter tuning was repeated 10 times, giving 10 fitted models for each combination of method, grid size, discretization scheme, and training set size. In general, the discretization scheme, both granularity and discretization method, are hyperparameters. In the simulation experiments, however, we are interested in comparing the performance across discretization schemes, and to this end we obtain a model for each scheme. But when we in Sect. 5 compare performance on real data, we need to include the discretization scheme in the hyperparameter tuning.
For evaluating the predictive performance of the methods, we consider two metrics on the uncensored held-out test set. The first metric is the average mean squared error (MSE) between the survival estimates and the true survival function at all 1000 time points τ 1 , . . . , τ 1000 MSE = 1 100, 000 HereŜ(τ j | x i ) and S(τ j | x i ) are the estimated and true survival functions, respectively, for individual i (in the test set) at time τ j . So, in this regard, the discrete-time survival estimates are represented by step functions, as illustrated in Fig. 2. Note that the MSE (19) is only applicable for simulated data, as the true survival functions S(τ j | x i ) are not known in real-world applications. The second metric is the time-dependent concordance (Antolini et al. 2005), which evaluates a method's ability to correctly rank individuals' survival estimates according to their event times. This is achieved by estimating the probability of correctly ranking two arbitrary individuals where T i and T j are the known potentially right-censored event times of individuals i and j. In other words, if individual i experiences the event of interest while individual j is still at risk, a good set of predictions should give a lower survival estimate for individual i than individual j at the event time of individual i. Note that, contrary to the MSE, larger values for the concordance are considered better.

Comparison of discrete-time methods
We start by comparing the two discrete methods from Sect. 2.2, that parameterize the PMF and the discrete-time hazards. We refer to them as PMF and Logistic-Hazard, respectively.
In Fig. 4 we plot the median test scores of the two methods versus the grid size used for discretization. The number above each plot gives the size of the training set used to fit the methods. The full lines represent equidistant grids, while the dotted lines are from grids obtained with quantiles from Kaplan-Meier survival estimates. We have also included the constant hazard interpolation (CHI) of the survival estimates from the Logistic-Hazard method (see Sect. 3.2).
For the smallest training set of size 3000 we see that the best performance (smallest MSE and highest concordance) is obtained with a grid of size 25 and that the finer grids of size 100 and 250 result in much worse performance (higher MSE and lower concordance). For the two larger training sets, the finer grids have generally performance on par with or better than that of grid of size 25. This is reasonable as coarser grids require fewer parameters in the neural networks, and the networks with very fine grids are therefore more likely to overfit the data. Nevertheless, the coarsest grid of size 5 seems to only work well for the interpolated estimates, and does very poorly for the discrete estimates. The discretization grids from Kaplan-Meier quantiles seem to give slightly better scores than the equidistant grids for the Logistic-Hazard with the coarsest grids; in particular for the smaller training sets. This difference is, however, quite small. Comparing the discrete survival estimates from Logistic-Hazard (blue lines) with the CHI estimates (orange lines), we see that the two lines overlap for finer grids. This is expected as the effect of interpolation decreases as the grids become finer.
In general, the PMF method does not perform as well as the Logistic-Hazard.

Comparison of interpolation schemes
In the following, we compare the interpolation schemes for the discrete-time hazard method Logistic-Hazard. The experiments are not shown for the PMF method as the results are very similar.
In Sect. 3.2 we presented two methods for interpolation of discrete survival estimates. The first assumes constant density in each interval (denoted CDI for constant density interpolation), while the second assumes constant hazard in each interval (denoted CHI for constant hazard interpolation). In our simulation study, we have four grid sizes and two discretization schemes. As the hyperparameter tuning was repeated 10 times this gives 80 fitted models for each method on each data set. In Fig. 5, we plot the scores of these 80 models sorted from best to worst, as this both tells us the best performance, in addition to the stability of the methods. The figure Clearly, there is almost no difference in performance between the two interpolation schemes, while the discrete estimates have slightly worse best-case performance and much worse worst-case performance. So the interpolation primarily helps with stability in performance across discretization schemes, but similar performance can be obtained for discrete predictions given careful hyperparameter tuning.
As the two interpolation schemes perform the same, we will in the further simulations only include the CHI estimates as they and the continuous-time PC-Hazard method both assume constant hazard rate, simplifying the comparison between the methods.

Comparison with PC-Hazard
Finally, we compare the previous methods with our proposed continuous-time hazard method from Sect. 3.3, PC-Hazard. In Fig. 6 we plot the MSE and concordance for the interpolated Logistic-Hazard (CHI) method and the continuous-time PC-Hazard method. Note that the y-axis is compressed compared to the previous figures. First, we notice that PC-Hazard does better for the coarsest grid with only five grid points, while Logistic-Hazard (CHI) typically performs best with 25 grid points. In general, the differences in performance between the two models are very small. We note, however, that the best performing model for the smallest training set is the PC-Hazard with a Kaplan-Meier grid of size 5, both in terms of MSE and concordance. Finally, we again see that the Kaplan-Meier quantiles seem to give slightly better performance than the equidistant discretization when the grids are coarse.  Fig. 7 in Appendix A, we have included a plot of the same type as Fig. 5 for the Logistic-Hazard (CHI) method, the Logistic-Hazard method, the PMF method, and the PC-Hazard method. The figure again shows that the PMF method performs slightly worse than the other methods, while the PC-Hazard method performs similarly to the Logistic-Hazard (CHI) estimates.

Summary of simulations
To summarize the results of the simulations, we have shown that the size of the discretization grid (number of τ j 's) has a large impact on the performance of the methods, and therefore needs to be carefully tuned. Finer grids enable the methods to reduce bias in the predictions but require more parameters in the neural networks (higher variance). By defining the discretization grid with Kaplan-Meier quantiles, the performance for the coarser grids typically improves, while it has no apparent effect for finer grids.
Interpolation of the discrete-time survival estimates alleviates some of the sensitivity to the discretization scheme. For the coarser grids, interpolation was generally found to improve performance, while it does affect the performance for finer grids. The performance of the two proposed interpolation schemes, CHI and CDI, was more or less indistinguishable.
Comparing the three methods, we found that PMF did not perform as well as the Logistic-Hazard, both in terms of best-case performance and stability to discretizationgrid configurations. PC-Hazard was found to be competitive with the interpolated Logistic-Hazard method and even performed better for the smallest training set. But the differences between all methods were small, and the size of the training sets and the grid size were shown to have a much larger impact on the performance than the choice of method.

Experiments with real data
We now compare the methods discussed in this paper to other methods in the literature, in particular DeepHit ( No alterations were made to the NWTCO data set. The size, the number of covariates, and the proportion of censored individuals in each data set are given in Table 1.

Hyperparameter tuning and evaluation
The experiments were conducted by five-fold cross-validation. For each split, a separate hyperparameter search was conducted, meaning that for each method we may end up with separate hyperparameters for each split.
For hyperparameter tuning, we set aside 20% of the data in each split as a validation set (corresponding to 16% of the full data set). For each method considered, we used the tuning criteria proposed in the original papers for choosing the best hyperparameters.
For the methods presented in this paper, we cannot use the validation loss for the hyperparameter search, as done in the simulations in Sect. 4.1. This is because we now need to include the discretization scheme as part of the hyperparameter search, and the losses (negative log-likelihoods) are dependent on the granularity of the discretization scheme (dependent on the number of output nodes). Instead, we will use the integrated Brier score (IBS) by Graf et al. (1999) computed over 100 equidistant points between the minimum and maximum observed times in the validation set. The IBS considers both discrimination and calibration of the survival estimates, and accounts for censored individuals by weighting the score by the inverse of the estimated censoring distribution.
For evaluation, we will consider the IBS in addition to the time-dependent concordance (Antolini et al. 2005). In contrast to the IBS, the concordance only evaluate the discriminative capabilities of a method's predictions. It is interesting to study both metrics as there might be a trade-off between well-calibrated estimates and good discriminative performance. For these real data sets, we do not know the true survival function and we can therefore not use the MSE (19) reported in the simulation studies.
The experiments were conducted using the same hyperparameter search and training strategy as presented in Section 6.1 of the paper by Kvamme et al. (2019), but we decrease the learning rate by 0.8 at the start of each cycle, as this was found to give more stable training. For the methods requiring discretization of the time scale, the hyperparameter search considered grid sizes of 5, 25, 50, 100, and 200, both with equidistant spacing and by using the Kaplan-Meier quantiles. The best parameter configuration for each method in each of the five cross-validation splits was fitted 10 times, and we calculated the median concordance and integrated Brier score (IBS) of the 10 repetitions and averaged these over the five folds.

Results
The results are presented in Tables 2 and 3. In terms of concordance, we see that DeepHit and PC-Hazard perform very well. The three Logistic-Hazard methods and Cox-Time all perform close to PC-Hazard, while the PMF, RSF and the other Cox methods perform slightly worse. The concordances of the two proposed interpolation schemes, CHI and CDI, are very similar, but the CDI method tends to give slightly better scores. There does, however, not seem to be much performance gain in interpolation for the concordance.
Examining the IBS in Table 3 (smaller is better) we again find that PC-Hazard performs very well. But now, DeepHit does quite poorly. This is expected as DeepHit is designed for discrimination rather than well-calibrated estimates (see Kvamme et al. 2019). In general, the PMF, the RSF, and the three proportional Cox methods seem to have slightly higher IBS than the Hazard methods, but again the differences are quite small. Cox-Time performs quite well on all data sets except for FLCHAIN and NWTCO. Comparing the interpolation schemes of Logistic-Hazard, it seems that CDI still performs slightly better than CHI, although both are quite close to the discrete estimates of Logistic-Hazard.
In summary, all three methods discussed in this paper are competitive with existing survival methodology. However, the interpolated Logistic-Hazard and the PC-Hazard seem to give the most stable high performance considering both discrimination and calibration.

Table 2
Concordance from 5-fold cross-validation on real-world data sets In this paper, we have explored survival methodology built on neural networks for discrete-time data, and how it can be applied for continuous-time prediction. We have compared two existing discrete-time survival methods that minimize the negative log-likelihood of right-censored event times, where the first method (Lee et al. 2018) parameterize the event time probability mass function (PMF), while the second method (Gensheimer and Narasimhan 2019) parameterize the discrete hazard rate (Logistic-Hazard). Through empirical studies of simulated and real data sets, we found that the Logistic-Hazard method typically performs better than the PMF parametrization, both in terms of discrimination and calibration of the survival predictions.
We proposed two interpolation schemes for the discrete methods. Both schemes were found to improve predictions for methods with a coarse discretization of the time scale. In particular, as coarser discretization reduces the number of network parameters, the interpolation schemes gave the largest improvements when applied to smaller data sets. The two interpolation schemes were found to perform very similarly.
We also proposed a new continuous-time method that assumes constant hazard in predefined time intervals (PC-Hazard). The method was found to perform very well compared to existing methods, both in terms of discrimination and calibration. Furthermore, in a simulation study, we found that the method continued to perform better for coarser discretization grids than the interpolated Logistic-Hazard method. This was particularly beneficial for the smallest training set in the simulation study.
All three methods investigated in this paper need some form of discretization or coarsening of the time scale. In that regard, we proposed a simple scheme that uses the quantiles of the event time distribution estimated by Kaplan-Meier, and showed through simulations that the quantile-based grids typically outperformed equidistant grids for coarser grids.
In summary, we found that all three methods perform quite similarly, and the choice of discretization has a larger impact on performance than the choice of method. For the discrete methods, interpolation of the survival predictions will typically make the performance less sensitive to the discretization scheme.
Some interesting further development of our work would be to extend the models to allow for left-truncated event times and competing risks. While these are common topics in survival analysis, the literature on neural network extensions is quite limited.
Finally, we have only considered time-to-event prediction given the information that is available at the outset of a study. One could also be interested in predicting the remaining time until an event given observation of time-dependent covariates and that the event has not yet happened by a time t 0 . The landmarking approach of van Houwelingen and Putter (2011) offers one possible framework for such dynamic predictions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
where z is the floor operation. We drawx j iid ∼ Unif[−1, 1], and β k iid ∼ N (0, 1). The forms of the γ i (x)'s have been chosen to obtain reasonable survival functions. In particular, γ 2 (x) ensures that the number of periods is a multiple of 2, as we wanted to introduce a more restricted seasonality pattern in hazard than any arbitrary period.
Hence, we sample u j,k iid ∼ Unif[−1, 1] for k = 1, . . . , m − 1, and set u j,k =x j − k i=1 x j,i β j,i , giving the covariates Using this scheme, it is straightforward to change the number of covariates without affecting the hazards. The code for generating these simulations is available at https:// github.com/havakv/pycox. where γ = max j (z j ), to ensure that we only take the exponential of non-positive numbers. Hence, by rewriting the loss (11) in terms of φ j (x), with φ m+1 (x) = 0 and γ i = max j (φ j (x i )), we obtain For the discrete hazard parametrization, we note that the loss in (7) is written as a sum over binary cross-entropy terms. This allows us to rely on existing implementations of binary cross-entropy to ensure numerical stability. In practice, these implementations use the log-sum-exp trick on the φ j (x) (the logits).
Finally, for the continuous hazard parametrization, we use existing implementations of the softplus function which uses a linear function over a certain threshold, meaning log(1 + exp[z]) ≈ z for large values of z. However, we also note that for z ≈ 0, we have that log(1 + z) ≈ z. Hence, for φ κ(t i ) (x i ) 0 we use that

Appendix C Multi-task logistic regression as PMF
Multi-task logistic regression, proposed by Yu et al. (2011), provides a generalization of the binomial log-likelihood to jointly model the sequence of binary labels Y j = 1{T * ≤ τ j }. This means that Y = (y 1 , . . . , y m ) is a sequence with zeros for every time τ j up to the event time, followed by one's. Then .
(C.1) Yu et al. (2011) only consider linear predictors ψ k (x) = x T β k , but this was extended to a neural network by Fotso (2018). The parameters of ψ k (x) are found by minimizing the negative log-likelihood in (6).
As f (τ j | x) = P(Y = (y 1 , . . . , y m ) | x), where y k = 1{k ≥ j}, the expression in (C.1) can be written as where φ j (x) = m k= j ψ k (x). Hence, the multi-task logistic regression is equivalent to the PMF in (9), but where φ j (x) is the (reverse) cumulative sum of the output of the network ψ(x) ∈ R m . To the extent of our knowledge, there are no benefits to this extra cumulative sum. Instead, it simply requires unnecessary computations, and, for large m, it can cause numerical instabilities.