1 Introduction

Human blood is a key component for several care treatments and plays a crucial role in all health care systems. It is needed to save lives in acute emergencies, to allow for many types of surgical interventions, such as organ transplants, and it is continuously required for the survival of chronic patients. Unfortunately, blood cannot be produced in laboratory but can only be withdrawn from healthy subjects, and its short shelf life limits the period between withdrawal and use. Therefore, blood is a limited resource, while its demand is very high. For example, before COVID-19, the demand was about 10 million units per year in the US and 2.1 million in Italy (World Health Organization 2012), and these values are growing again. In Western countries, blood is usually collected from volunteer donors, i.e., unpaid individuals who donate their blood voluntarily and for free. Blood is classified into groups and according to the Rhesus factor (Rh), and patients receive the blood of their own type (combination of group and Rh factor) or a compatible one. There are two types of donations, whole-blood and apheresis, where apheresis refers to the donation of specific blood constituents, such as erythrocytes or platelets, in which a mechanical apparatus separates the required blood constituents and reinfuses the others into the donor.

Blood is supplied by the Blood Donation (BD) system, which is tasked with providing an adequate supply of blood units to meet the demand of transfusion centers and hospitals, while respecting their storage capacity and the temporal profile of the demand. The BD Supply Chain (BDSC) can be divided into four echelons (Sundaram and Santhanam 2011): collection, transportation, storage and utilization. Many problems arise in the BDSC management from the collection echelon to the final utilization of blood units, which have been largely addressed in the literature (Beliën and Forcé 2012; Baş et al. 2016). In this paper, we focus on the collection echelon, which is very relevant to the entire BDSC because problems at this stage may deteriorate the performance of the entire BDSC and impact blood shortages and wasted units (Baş Güre et al. 2018). For example, it is straightforward that increasing the number of donations improves the performance of the BD system, but also an effective management of the available donations, which directs donors to suitable days, can avoid shortage and wastage of blood units.

The management of a blood collection center must take into account a twofold perspective (Baş Güre et al. 2018; Baş et al. 2018). On the one hand, it should pursue operational goals common to several health care providers, such as waiting time reduction, optimal workforce planning, and effective appointment scheduling. On the other hand, a blood collection center produces blood units and blood products to meet storage and utilization demands. Two aspects are particularly critical from the production viewpoint. Firstly, the number of produced units and all activities carried out at any blood collection center strongly depend on the number of donors who arrive daily at the center to make a donation. Predicting the daily number of donors in advance is therefore essential for proper planning and sizing of the collection center, and to provide reliable information on the future production of blood units to the rest of the BDSC echelons. Secondly, blood collection centers invest, also in economic terms, to carry out campaigns to promote and acquire further donors. The goal is to enroll novel donors who regularly and frequently donate blood. Therefore, it is important to identify the most productive donor profiles to better target these promotion campaigns and recruit donors who will guarantee a high production of blood units over time. Both needs can be addressed through a statistical prediction model for the intensity function of the donation event. For the profiling goal, the profiles of donors who donate more frequently can be identified by analyzing donors’ characteristics that significantly yield shorter waiting times before the next donation. For the prediction goal, the number of donors arriving on each day of a given horizon can be obtained by combining the predictions of all donors who can donate within that horizon.

This work considers whole-blood donations, which cover most of the donation events. We propose a Bayesian model for the intensity function of the blood donation event, which describes it as a function of individual donor’s random frailties and their time-dependent covariates. The aim is to explain donors’ behaviors since their first donation, based on their individual characteristics. Under the Bayesian approach, the parameters of the likelihood (the conditional distribution of individuals in the sample) are random, and all the statistical inference is based on the posterior distributions of these random parameters, namely, the conditional distribution of the parameters given observed data. The posterior distribution of all parameters also allows to predict donors’ arrivals, thus supporting planning and other management tasks, and to identify the parameters that influence the intensity of the donation, thus supporting profiling. To show an application of our model, we apply it to the data provided by the Milan department of the Associazione Volontari Italiani del Sangue (AVIS), referred to as AVIS Milan in the following. AVIS is the largest Italian blood donor organization, founded in Milan in 1927. Today, it ensures about 80% of the national blood supply, and is present throughout the country with over 3400 centers. AVIS Milan, one of the most important nodes in the AVIS network, collects blood from donors residing or working in Milan. It collects about 1500 whole-blood donations per month and supplies them to the Niguarda hospital, located in the city. The dataset includes the list of donations from each donor, along with the measurements of some donors’ vital parameters (e.g., heart rate, blood pressure and hemoglobin) and information on donor’ habits acquired before each donation through an interview with a physician.

The remainder of this paper is organized as follows. Section 2 overviews the literature related to the problem addressed in this work. Section 3 describes the variables and provides an exploratory analysis of the available dataset. Section 4 details our model, while Sect. 5 shows the posterior inference. Then, Sect. 6 exploits the model outcomes and predictions to support profiling decisions and management tasks. Finally, Sect. 7 concludes the work.

2 Literature review

The BDSC has been extensively studied, as documented in Osorio et al. (2015) and Baş Güre et al. (2018). However, these literature reviews show that most studies focus on the storage and utilization echelons of the BDSC. In contrast, the collection has been marginally studied compared to the others (Ayer et al. 2018, 2019; Baş et al. 2018).

The work of Baş et al. (2018) has contributed first to the development of a decision support tool for blood collection, proposing an appointment scheduling system that includes a linear programming model for preallocating time slots to blood types and a prioritization policy to assign a slot when a donor makes the reservation. Then, this framework was extended in Doneda et al. (2023) to include home blood donations and in Yalçındağ et al. (2020) to face uncertain donor arrivals. In fact, uncertainty is recognized as a major sticking point when dealing with blood donation management (Lanzarone and Yalçındağ 2019), as well as in several health care management contexts (Addis et al. 2015). Therefore, it must necessarily be considered to adequately address the problems arising in blood donation management, and to create more effective tools, solution methods and decision support systems for BDSC. In the following, to show its impact, we first overview recent works that deal with uncertainty in the management of blood collection. Then, we analyze recent works that propose stochastic models to predict the uncertain quantities affecting blood donation and to classify donors.

From the management viewpoint, Jabbarzadeh et al. (2014) developed a robust optimization model for blood facility location and allocation decisions during post-disaster periods under supply and demand uncertainty. Zahiri et al. (2015) adopted a robust possibilistic fuzzy programming approach to determine the best locations of blood facilities coping with several uncertain parameters. Ramezanian and Behboodi (2017) developed a robust optimization approach for the location-allocation problem of blood collection centers in the presence of stochastic demands. Rabbani et al. (2017) analyzed the mobile blood collection system for platelet production with uncertain donors’ arrivals. Hamdan and Diabat (2019) proposed a stochastic model for red blood cell supply that simultaneously considers production, inventory and location decisions. Finally, as mentioned, Yalçındağ et al. (2020) included uncertain donors’ arrivals in the BD appointment scheduling.

Other works specifically focused on prediction and classification tasks in BD (Khalid et al. 2013). Darwiche et al. (2010) combined a principal component analysis and a support vector machine to predict blood donation occurrences, and applied this combined approach to donor data from a blood transfusion service in Taiwan. Santhanam and Sundaram (2010) and Sundaram and Santhanam (2011) used decision trees to classify donors, in order to determine voluntary blood donorship based on blood donation patterns. Ramachandran et al. (2011) classified blood donors using a decision tree to identify regular donors and enable blood banks to organize blood donation camps efficiently. Similarly, Boonyanusith and Jittamai (2012) used neural networks and decision trees, to identify patterns in blood donors’ behaviors based on the factors influencing the donation decision. In a perspective more similar to that of our work, Testik et al. (2012) adopted a two-step cluster method, together with classification and regression trees, to identify donors’ daily and hourly arrival patterns, considering data from a Turkish hospital. Khalilinezhad et al. (2014) used association rule mining to find the best donors within the whole population, and applied their approach to data from two cities in the Middle East. More recently, Alkahtani and Jilani (2019) adopted a classification approach to predict returning donors and time series analysis to predict donation dates, focusing on the lower number of returning donors versus the higher number of non-returning ones. Bischoff et al. (2019) used time series forecasting to predict the daily number of donations to a tertiary care center, to account for a decrease in platelets production preemptively. Shashikala et al. (2019) applied naive Bayes technique and K-nearest neighbors algorithm to predict whether individuals are donors or not. Kircic et al. (2020) used logistic regression and a naive Bayes classifier to determine blood donation probabilities. Kauten et al. (2021) applied machine learning algorithms to model donor retention to support cost-effective outreach programs. They focused on predicting which donors will donate blood during a future time window, and applied the algorithm to operational data obtained from a large regional blood center in the US.

A few works addressed the BD prediction task in the Bayesian setting. Tavakol et al. (2016) proposed a log-normal hazard model with gamma correlated frailties to model the chance of donating blood. They considered data from an Iranian province and identified the types of donors with higher chances to donate. Mohammadi et al. (2016) implemented a bivariate zero-inflated Poisson regression to jointly model the number of blood donations and that of blood deferrals. They used non-informative priors, both in the presence and absence of covariates. Kassie and Birara (2021) adopted a Bayesian binary logistic regression approach to assess the impact of the covariates in blood donation, focusing on data from Northwestern Ethiopia.

Differently from these works, we consider recurrent events from the blood donation process. In the statistical literature, recurrent event data are tackled alternatively as: (i) modelling the intensity function of the event counts \(\{N(t), t\ge 0\}\); (ii) modelling the whole sequence of gap times between successive realizations of the recurrent events (Cook and Lawless 2007). The second approach is more appropriate when the events are relatively infrequent or when, after an event, individual renewal takes place in some way. The first approach is most suitable when individuals frequently experience the event of interest, as in our application, and the occurrence does not alter the process itself. The canonical framework for analyzing event counts is that of inhomogeneous Poisson processes. Among the original contribution of this paper, we model the intensity function of the whole-blood donation event process as a function of the individual donors’ random frailties and their time-dependent covariates. We adopt a full Bayesian approach, assuming a prior distribution for all unknown likelihood parameters.

3 Data and variables

We consider donation data of AVIS Milan from January \(1\)st, 2010 to June \(30\)th, 2018 concerning the recurrent donors, namely the donors who started to donate after January \(1\)st, 2010 and donated at least two times in those years. Only whole-blood donations are included in the study, and time is measured in days. Furthermore, as the focus is on recurrences, the first donations corresponding to time \(t=0\) are removed.

We include two types of data: donations and donors’ personal data, and information about donors’ habits. Specifically, we include donor ID, sex, age at first donation, age at current donation, blood group (A, B, AB or 0) and Rh factor (POS or NEG), donation ID and time of the donation, indicators of smoking, drinking, physical activity and stress level, tea and coffee consumption, diet type, height (in \(\,\textrm{m}\)), weight (in \(\textrm{kg}\)), Body Mass Index (BMI), and health state values such as systolic and diastolic blood pressure (SBP and DBP, respectively), heart rate (HR) and hemoglobin (HGB). The health state values are recorded with each donation, measured by a physician. According to the Italian donation rules, any candidate donor who is going to donate for the first time must be between 18 and 60 years old, while the age limit is extended to 65 years for successive donations; however, physicians can allow a donor to keep donating until 70 years old if eligible after clinical evaluation of the age-related risks. Donor’s weight must be greater than \(50 \, \textrm{kg}\), and blood pressure, heart rate and hemoglobin values must lie between fixed limits. As an example, hemoglobin range is \([13,18] \, \textrm{g}/\textrm{dl}\) for male donors and \([12,16] \, \textrm{g}/\textrm{dl}\) for female donors. The minimum time gap time between two consecutive donations is 90 days for men and women in menopause, and 180 days for the other women. However, a small tolerance on these thresholds is possible after clinical evaluation. In particular, in our dataset, the minimum gap time is 85 days for men and 150 days for women. These rules are consistent with those that regulate blood donation processes in other countries, such as in Spain (Aldamiz-Echevarria and Aguirre-Garcia 2014). An observation time interval is associated to each donor, starting with the donor’s first donation (\(t=0\)) and ending on June \(30\)th, 2018. Therefore, the duration of a donor’s observation time interval can be different from that of the others. The final dataset consists of 25, 689 observations and 5, 937 unique donors, including 4, 005 men and 1, 932 women.

Donation data show that most recurrent donors made only one donation yearly. Figure 1a displays the histogram of empirical yearly rates of donation, i.e., the total number of donations divided by the number of years under observation for each donor, while Fig. 1b reports the histogram of all donors’ gap times. The histogram in Fig. 1a is right-skewed, whereas that in Fig. 1b is bimodal due to the different donation rules between men and women.

Fig. 1
figure 1

Histograms of yearly rates of donation (a) and gap times (b) on the log scale. The red vertical dashed lines denote the minimum waiting times for men (\(\log 85\)) and women (\(\log 150\)), respectively

The maximum number of recurrences is 29 for men and 14 for women. Figure 2 provides the barplots of normalized recurrences grouped by sex, which are defined as:

$$\begin{aligned} \dfrac{\text {number of men who perform } k \text { recurrences}}{\text {total number of men} \times 29}, \quad k=1, \ldots , 29 \end{aligned}$$

for men and

$$\begin{aligned} \dfrac{\text {number of women who perform } k \text { recurrences}}{\text {total number of women} \times 14}, \quad k=1, \ldots , 14 \end{aligned}$$

for women. The normalization allows for a fair comparison of female and male donation recurrences. Note that the 99% empirical quantile for women is 11, which coincides with the 88% empirical quantile for men. This means that, subject to law obligations, men donate much less than women.

Fig. 2
figure 2

Barplot of donation normalized recurrences for male and female donors

The available data have missing values in some covariates. Due to a significant number of missing data, diet and stress factors have been discarded from the dataset. Missing values of the other variables were imputed using suitable frequentist methods, via the R package MICE (van Buuren and Groothuis-Oudshoorn 2011). In particular, before imputation, there were 40% missing values for coffee or tea consumption, 14% for HGB, 9% for SBP, DBP and HR, 0.25% for BMI, and 0.19% for smoking, alcohol consumption and physical activity. The peak of missing data was recorded in 2013 and the minimum in 2018.

Among the factors that may affect blood donation recurrence, we investigate the covariates reported in Table 1. Note that information on smoking, alcohol consumption and level of physical activity are communicated to the physician by the donors themselves; as such, they could be inaccurate.

Table 1 Covariates included in the model

The mean age of the sample at the first donation is 32 years (with a large standard deviation \(\simeq 10\) years), male donors are about twice as female donors, the majority of the population has blood group 0 (46.4%), and positive Rh factor is more frequent than negative Rh factor (about 87 and 84% of male and female donors are Rh-positive, respectively). There are more non-smokers (67.49%) than smokers, alcohol non-consumers (69.82%) than consumers, and 76.03% of the donors practice physical activity. About 69% of donors have healthy weight, underweight women are about twice as underweight men, whereas overweight men are four times overweight women (Table 2).

Table 2 Sample frequencies of donors’ BMI, classified using the standard weight status categories (Center of Disease Control and Presentation; www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html)

With regard to the health state variables (SBP, DBP, HR, HGB), the boxplots in Fig. 3 show that women have a lower value of hemoglobin than men, as expected, while SBP, DBP and HR are homogeneous across sexes.

Fig. 3
figure 3

Boxplots of HGB (top left), SBP (top right), DBP (bottom left) and HR (bottom right) grouped by donor sex

Only HGB and DBP are considered time-dependent variables in the model. This assumption arises from preliminary analyses, which indicated that the trajectories of SBP and HR do not seem to vary over the observation period.

4 Methods

Let I be the number of donors. For each donor \(i=1, \ldots , I\), we consider a single recurrent event process starting at day \(T_{i0}=0\) of their first donation, where \(0<T_{i1}< T_{i2} < \cdots\) denote donor i’s days of subsequent donations. A counting process \(\{N_i(t)\}_{t\ge 0}\) with \(N_i(t) =\sum _{k=1}^{+\infty } \textbf{1}\{ T_{ik} \le t \}\) records the number of donations of donor i up to day t beyond the first donation. \(\{N_i(t)\}_{t\ge 0}\) is right-continuous, and mathematically defined by its event intensity function:

$$\begin{aligned} \lambda _i(t)= \lim _{\varDelta t\rightarrow 0} \frac{P \left( N_i(t+\varDelta t^-) - N_i(t)=1 \right) }{\varDelta t} \end{aligned}$$
(1)

where the difference \(N_i(t+\varDelta t^-) -N_i(t)\) represents the number of donations of donor i in the interval \([t,t+\varDelta t)\). Roughly speaking, \(\lambda _i(t)\) gives the instantaneous probability of an event occurring on day t. Because of (1), \(\{N_i(t)\}_{t\ge 0}\) is an inhomogeneous Poisson process (Cook and Lawless 2007, Chapter 1). Donor i is observed over the time interval \([0,c_i]\), in which he/she donates blood at days \(t_{i1}, \ldots , t_{i\, n_i}\) for a total of \(n_i\) donations, with \(0< T_{i1} = t_{i1}< \cdots < T_{i \, n_i} = t_{i \, n_i} \le c_i\). If \(t_{i n_i} < c_i\) then \(c_i\) is a censoring time.

Donor-specific information is fed into the model by including covariates and an individual random effect function into the associated individual multiplicative event intensity function \(\lambda _i(t) = \lambda _i(t|\varvec{x}_t)\), modeled as follows:

$$\begin{aligned} \lambda _i(t \, | \, \varvec{x}_t) = u_i(t) \, \textrm{e}^{\varvec{x}_{i}^{\prime }(t) \, \varvec{\beta }} \end{aligned}$$
(2)

where \(\varvec{x}_{i}^{\prime }(t) = \left( x_{i1}(t), \ldots , x_{ip}(t) \right)\) is the p-dimensional vector of covariates of donor i at time t, \(\varvec{\beta }\) the vector of regression coefficients, and \(u_i(t)\) the i-specific donor random effect. Symbol \(^\prime\) denotes the transposition of the column vector. Our model belongs to the class of Cox’s proportional hazard regressions with random effects (Klein and Moeschberger 2003). Hence, Eq. (2) can be explained as follows: the logarithm of the intensity \(\lambda _i(t)\) of the i-th donor is equal to the linear predictor \(\varvec{x}_i'(t)\varvec{\beta }\) plus an individual random effect \(u_i(t)\). In this way, the effect of the j-th covariate is represented by parameters \(\beta _j\). According to random effects models, the individual parameter \(u_i(t)\) might represent the individual variability in the log-intensity, which is not explained by the covariates.

Some covariates in \(\varvec{x}_{i}(t)\) are time-dependent, measured at day t in conjunction with donations, while others are recorded at day \(t=0\) and considered constant over time. The time-varying covariates in \(\varvec{x}_{i}(t)\) are assumed to be step functions:

$$\begin{aligned} \varvec{x}_{i}(t) = \varvec{x}_i(t_{ij}) \qquad \text{ if } \; t_{ij} \le t < t_{i\,j+1}, \qquad j=0, 1, \ldots , n_i, \quad i = 1, \ldots , I \end{aligned}$$
(3)

with \(t_{i\, n_{i} +1} = c_i\) and \(t_{i0}=0\) for all i. The vector \(\varvec{x}_{i}(t)\) also includes interaction terms. In this application, the preliminary data analysis (Sect. 3) suggested to include the interactions between sex and hemoglobin, sex and Rh factor, and sex and BMI. The covariates are listed in Table 1. Note that, because of the coding of the blood group, the intercept includes the regression parameter corresponding to GROUP_A. Thus, the total number of covariates including interactions is 17. Only HGB and DBP are considered time-dependent, as mentioned in Sect. 3, and all numerical covariates have been standardized.

To write the likelihood, we first specify the assumptions about the random effect function \(u_i(t)\) of donor i. We assume a piecewise constant \(u_i(t)\) on the time domain [0, c], with \(c = \max _{i = 1, \ldots , I} c_i\) and \(K=10\) intervals identified by the cut-points \(0 = a_{0}< a_{1}< \cdots < a_{K} =c\), i.e.:

$$\begin{aligned} u_i(t) = \sum _{k=1}^K u_{ik} \, \textbf{1}_{(a_{k-1},a_{k}]}(t). \end{aligned}$$
(4)

As donors cannot donate more than a certain number of times per year, we introduce an at-risk process \(\{Y_i(t)\}_{t\ge 0}\) in the model, which represents the risk of donor i of experiencing a donation at day t, with:

$$\begin{aligned} Y_i(t)= \textbf{1}_{ \left( T_{N_i(t^-)} +\varPhi _i\;,\; c_i\right) }(t). \end{aligned}$$
(5)

The quantity \(Y_i(t)\) forces the intensity to be 0 for the next \(\varPhi _i\) days after every donation, i.e., it imposes that a donor cannot donate until \(\varPhi _i\) days after the last donation, and even after their censoring time \(c_i\). \(\varPhi _i\) depends on donor’s sex, and we fix \(\varPhi _i=85\) if i is a male donor and \(\varPhi _i=150\) if i is a female donor; see the discussion in Sect. 3. Furthermore, as this process only includes the administrative-censored time \(c_i\) and \(\varPhi _i\), it seems reasonable to assume that the at-risk indicator \(Y_i(t)\) and the observation at day t, given by \(N_i(t+\varDelta t^-) - N_i(t)\), are independent. The contribution \({\mathscr {L}}_i\) of donor i to the likelihood function is derived using Theorem 2.1 in Cook and Lawless (2007), which gives the conditional probability density of \(n_i\) events occurring at days \(t_{i1}< \cdots < t_{in_i}\) for each donor i who recurrently donates over the time interval \((0,c_i]\). Hence:

$$\begin{aligned} {\mathscr {L}}_i&= \prod _{j=1}^{n_i} \lambda _i(t_{ij}|\varvec{x}_i(t)) \, \exp \left\{ -\int _{0}^{c_i} Y_i(s) \lambda _i(s|\varvec{x}_i(s)) ds \right\} \nonumber \\&= \prod _{j=1}^{n_i} u_i(t_{ij}) \, \exp \left\{ \varvec{x}^{\prime }_{i}(t_{ij})\varvec{\beta }- \int _{0}^{c_i}Y_i(s) u_i(s) \textrm{e}^{\varvec{x}^{\prime }_{i}(s)\varvec{\beta }} ds \right\} \nonumber \\&= \prod _{k=1}^{K}u_{ik}^{n_{ik}} \, \exp \left\{ \sum _{j=1}^{n_i}\varvec{x}^{\prime }_{i}(t_{ij})\varvec{\beta }- \sum _{k=1}^{K}u_{ik}\int _{a_{k-1}}^{a_k}Y_i(s) \textrm{e}^{\varvec{x}^{\prime }_{i}(s)\varvec{\beta }} ds \right\} \end{aligned}$$
(6)

where \(n_{ik}\) is the number of donations experienced by donor i in the interval \((a_{k-1},a_k]\) and \(n_i=\sum _{k=1}^K n_{ik}\). Consequently, the likelihood of all I donors is:

$$\begin{aligned} \mathcal {L} = \prod _{i=1}^{I} {\mathscr {L}}_i = \left( \prod _{i=1}^{I} \prod _{k=1}^K u_{ik}^{n_{ik}} \right) \exp \Bigg \{\sum _{i=1}^I \sum _{j=1}^{n_i} \varvec{x}^{\prime }_{i}(t_{ij}) \varvec{\beta } - \sum _{i=1}^I \sum _{k=1}^K u_{ik}\int _{a_{k-1}}^{a_k} Y_i(s) \textrm{e}^{\varvec{x}^{\prime }_{i}(s) \varvec{\beta }} ds\Bigg \}. \end{aligned}$$
(7)

As for the prior, we assume a priori independence between \(\varvec{\beta }\) and \(\{u_{ik}\}\) with:

$$\begin{aligned} \beta _1, \ldots , \beta _p {\mathop {\sim }\limits ^{\mathrm {i.i.d.}}}\mathcal {N}(0,\sigma ^2_0) \end{aligned}$$
(8)

and a hierarchical gamma prior for the steps \(\{u_{ik}\}\), with hyperparameters \(\{\lambda _{0k}\}\), which takes into account the length of the time-interval \(a_{k}-a_{k-1}\) (Christensen et al. 2010):

$$\begin{aligned} {\left\{ \begin{array}{ll} u_{ik} \, | \, \lambda _{0k}, c {\mathop {\sim }\limits ^{\mathrm {i.i.d.}}}\text{ Gamma } \, \left( \, c (a_{k}-a_{k-1}) \lambda _{0k}, \, c (a_{k}-a_{k-1}) \, \right) &{} \\ \lambda _{0k} {\mathop {\sim }\limits ^{\mathrm {i.i.d.}}}\text{ Gamma } \, (\delta ,\delta ) &{} \quad k=1,\ldots , K. \end{array}\right. } \end{aligned}$$
(9)

Equation (9) models a prior opinion of homogeneity among donors. Our choice corresponds to a discretized approximation of a very flexible non-parametric prior for the cumulative hazard, that is the gamma process prior (Kalbfleisch 1978). We have:

$$\begin{aligned} {{\,\textrm{E}\,}}(u_i(t)|\lambda _{01},\ldots ,\lambda _{0K},c) = \sum _{k=1}^K {{\,\textrm{E}\,}}(u_{ik}|\lambda _{0k},c) \textbf{1}_{(a_{k-1},a_k]}(t) = \sum _{k=1}^K \lambda _{0k} \textbf{1}_{(a_{k-1},a_k]}(t), \end{aligned}$$
(10)

so that:

$$\begin{aligned} {{\,\textrm{E}\,}}(u_i(t)) = \sum _{k=1}^K {{\,\textrm{E}\,}}(\lambda _{0k}) \, \textbf{1}_{(a_{k-1},a_k]}(t) = 1. \end{aligned}$$
(11)

The hyperparameters c and \(\delta\) quantify the uncertainty on the steps \(u_{ik}\) and the pairwise correlation between frailties. In particular, they measure how the prior of the steps \(\{u_{ik}\}\) widespreads around its mean \({{\,\textrm{E}\,}}(u_{i}(t))\) because:

$$\begin{aligned} {{\,\textrm{Var}\,}}\left( u_i(t) \, | \, \lambda _{01},\ldots ,\lambda _{0K},c \right) = \frac{\sum _{k=1}^K \lambda _{0k} \, \textbf{1}_{(a_{k-1},a_k]}(t)}{c} \end{aligned}$$
(12)

and hence:

$$\begin{aligned} {{\,\textrm{Var}\,}}\left( u_{ik} \right) = \frac{1}{c(a_{k}-a_{k-1})} + \frac{1}{\delta }. \end{aligned}$$

In addition, the correlation \(\rho (u_{i}(s), u_{j}(t))\) between \(u_{i}(s)\) and \(u_{j}(t)\) changes with c and \(\delta\) as:

$$\begin{aligned} \rho (u_{i}(s), u_{j}(t)) = \frac{c(a_{k}-a_{k-1})}{c(a_{k}-a_{k-1}) + \delta } \end{aligned}$$

for all \(s,t \in (a_{k-1},a_k]\) and \(k=1,\ldots ,K\), while \(\rho (u_{i}(s), u_{j}(t))=0\) otherwise.

Equation (10) provides an immediate interpretation of the step function \(\lambda _{0}(t):= \sum _{k=1}^K \lambda _{0k} \textbf{1}_{(a_{k-1},a_k]}(t)\) as the baseline intensity function. In the Bayesian framework, the hyperparameters \(\{\lambda _{01},\ldots ,\lambda _{0K}\}\) form a piecewise constant baseline intensity function. As a consequence, the posterior mean \({\widehat{\lambda }}_0(t)\) of \(\lambda _0(t)\), given by

$$\begin{aligned} {\widehat{\lambda }}_0(t) = \sum _{k=1}^K {{\,\textrm{E}\,}}(\lambda _{0k}|\text { data}) \textbf{1}_{(a_{k-1},a_k]}(t), \end{aligned}$$

is a Bayesian estimate of the baseline intensity function \(\lambda _0(t)\). The cut-points of the step functions \(u_{i}(t)\)’s over the time window have been taken equally spaced. The marginal prior (9) turns out to be a convenient choice, since in this case \(\sum _{k=1}^K \lambda _{0k} \textbf{1}_{(a_{k-1},a_k]}(t)\) can be interpreted as the centering hazard for all donors. Finally, in accordance with the scale-invariance of the gamma prior of this parameterization, the parameter \(v_{ik}\), defined by

$$\begin{aligned} v_{ik}= \dfrac{u_{ik}}{\lambda _{0k}}, \end{aligned}$$
(13)

represents the specific random effect of donor i at time \(t\in (a_{k-1},a_{k}]\) for \(k = 1, \ldots , K\). The terms \(\{v_{ik}\}_k\) have a multiplicative effect on the intensity function of donor i, so that \(v_{ik}>1\) indicates more propensity to experience a donation and vice versa, with the same covariate values.

5 Posterior inference

We compute the posterior distribution of the Bayesian model (7)–(9) using a Markov chain Monte Carlo (MCMC) algorithm. In particular, we employ the Hamiltonian Monte Carlo (HMC) through the software platform Stan (Stan Development Team 2020). The values of the hyperparameters in (8)–(9) are set to \(\delta =2\), \(c=0.01\) and \(\sigma _0^2=10^4\). We run two HMC chains of 5000 iterations, each one with a warmup of 3000 iterations, for a final sample of 4000 in total; MCMC convergence diagnostics available in the R package CODA show that convergence holds for all parameters (Plummer et al. 2006).

The posterior densities of the regression parameters in \(\varvec{\beta }\) for the Bayesian model with all 17 covariates including interactions are highly concentrated around 0. Therefore, we have discarded SBP and HR from the model, and the new model has been refitted. Figure 4 shows the marginal posterior densities of the 15 remaining covariates, in red if the factor increases the intensity of recurrence and in blue vice versa.

Fig. 4
figure 4

Marginal posterior densities of the regression parameters and interactions in \(\varvec{\beta }\) after discarding SBP and HR. Blue and red colors discriminate between negative and positive effects of the factors on the intensity function, respectively

A posteriori, \(\beta _{\text {SEX}}\) is concentrated on negative values, suggesting that male donors have a smaller recurrence intensity than female donors. Individuals with blood groups O, B, or AB exhibit a lower recurrence intensity compared to the reference blood group A. Similarly, donors with \(\text {RH}^+\) factor have a smaller recurrence intensity than donors with the reference \(\text {RH}^-\). This suggests that, since \(\text {RH}^-\) factor is much rarer than \(\text {RH}^+\) in the dataset but also in the Italian population, donors with negative Rh factor are more involved and regularly donate. Moreover, male donors with \(\text {RH}^-\) factor donate more frequently than female donors with \(\text {RH}^-\). Smokers and drinkers tend to donate less than non-smokers and non-drinkers, while non-active donors (PHYSICAL\(\_\)ACTIVITY = 0) are more likely to donate. Finally, HGB, DBP, age and BMI have a positive effect on the recurrence intensity. However, note that the effects of HGB and BMI are stronger for female donors than for male donors, since the marginal posteriors of \(\beta _{\text {SEX*HGB}}\) and \(\beta _{\text {SEX*BMI}}\) are concentrated on negative values.

Table 3 reports posterior summaries of the baseline parameters \(\lambda _{0k}\). The estimation of the piecewise baseline intensity function \(\lambda _0(t)\) suggests that donors are more likely to donate in their first year and that this propensity tends to decrease over time.

Table 3 Posterior mean, standard deviations and quantile of the baseline parameters \(\lambda _{0k}\)

Figure 5 displays the posterior means of the individual random effect functions \(\{v_{ik}\}_{k = 1, \ldots , K}\) defined in Eq. (13) for 10 randomly selected male and female donors. These plots show the random heterogeneity of the individuals in the sample, and express the variability that the other parameters of the model cannot quantify. From the figure we also see that the values of frailties are typically greater for men than for women. This is also confirmed if we plot the posterior means of more individual frailties, which are not included here. This is an expected result, because of the Italian donation rules for which men can donate more often than women.

Fig. 5
figure 5

Posterior means of the random effect functions \(v_{i1}, \ldots , v_{i 10}\) for ten randomly selected male (left) and female (right) donors

We have reported above the posterior inference when hyperparameter c is set to 0.01. Since this hyperparameter controls both the variance of each \(u_{ik}\) and the correlation between \(u_{ik}\) and \(u_{jk}\), we have also fitted the model for \(c=1\) and \(c=2\). The posterior estimates turned out to be robust and, for this reason, they are not reported here.

6 Profiling and prediction

Any blood collection center must deal with donor profiling and donation prediction.

Profiling is a key task to carry out effective campaigns to acquire further donors. Indeed, donor recruitment campaigns should be directed towards individuals whose characteristics could guarantee a high frequency of donation and continuity over the years. In this light, the proposed approach allows for predicting future donation patterns of each possible donor profile. More precisely, we can compute the posterior predictive distribution of potential novel donors, identified by representative values of the vector of covariates. As a result, the collection center will be able to appropriately choose the target of a recruitment campaign, directing it towards the most promising profiles, e.g., in high schools (very young donors) rather than in companies (older donors) or vice versa.

Prediction refers to the estimation of the blood units that will be donated in total and/or for each blood type over a given time horizon. This is required for the internal organization, to size the capacity and personnel needed on any day of the time horizon, and to predict the achievement of production targets for each blood type. In particular, for the internal organization, the time horizon is usually one week and there is no need to distinguish between different blood types. Instead, the production targets are set for a longer time horizon (e.g., one month) and for each blood type individually. In our framework, information can be provided in terms of the average number of donors over the time horizon, because each donation corresponds to a single blood unit and a donor can donate at most once a week or in a month.

6.1 Profiling

The aim of profiling is to identify donor profiles who donate more frequently, discovering the characteristics that significantly yield shorter waiting times before the next donation. With this aim, we compute the posterior predictive probability that a new donor l identified by covariates \({\varvec{x}}_l\) will donate after t days from their first donation. In particular, the posterior predictive probability that the new donor l will donate after at least t days from the first donation can be computed as the MCMC mean of \(P(W_{l1}>t|T_{l0}=0, \text {parameters})\), where we assume that the first donation of the new donor l is made at time \(T_{l0}=0\), so that \(W_{l1}=T_{l1}-T_{l0}\) is the waiting time of the first recurrence (corresponding to the second donation). This probability can be derived from Corollary 1 of Chapter 1 in Cook and Lawless (2007). We let t vary in the first three months in which donor l is allowed to donate, i.e., \(t \in (\varPhi _l+1, \varPhi _l+90)\) where \(\varPhi _l = 85\) days for men and 150 for women. In this way, only \(u_{l1}\), which covers the first 310 days of the donation process, is needed. We obtain:

$$\begin{aligned} P(W_{l1}>t|T_{l0}=0, \text {parameters})&= \exp \left\{ -\int _{0}^{t} \lambda _l(s) ds \right\} = \exp \left\{ -\int _{0}^{t} u_{l1} \textrm{e}^{\varvec{x}^{\prime }_{l} (s) \varvec{\beta }} Y_l(s) ds \right\} \\&= {\left\{ \begin{array}{ll} 1 &{} \text { if } t \le \varPhi _l \\ \exp \left\{ - u_{l1} \textrm{e}^{\varvec{x}^{\prime }_{l} \varvec{\beta }} (t-\varPhi _l) \right\} &{} \text { if } t > \varPhi _l \end{array}\right. } \end{aligned}$$

where we have assumed that the time-varying covariate vector \(\varvec{x}_{l}(s)\) is constant between the two successive donations considered, and we have denoted this value as \(\varvec{x}_{l}\).

A few relevant profiles have been selected and reported in Table 4 to provide a concrete example. Profile 0 refers to a healthy young man with an age at first donation equal to 25 years. Profile 1 describes a healthy young woman, 20 years old, with the rest of the covariates as in Profile 0. Profiles 2 and 3 refer to healthy middle-aged donors (a male donor and a female donor). Profiles 4–7 refer to middle-aged men and women with unhealthy lifestyle habits. Finally, Profiles 8 and 9 correspond to Profiles 0 and 1, but with the less common blood type AB and negative Rh factor. HGB and DBP of each profile are set to the sample mean of the same-sex and age-matched donor subset. Finally, values of all covariates have been kept constant during the analyzed 90-day period in each profile.

Table 4 Selected profiles of potential donors

Figure 6 shows the posterior mean and 95% posterior credible bounds of \(P(W_{l1}>t|T_{l0}=0, \varvec{x}_{l}, \text {parameters})\), for 10 simulated new donors (\(l=0,1,\ldots ,9\)). The faster these graphs decrease, the more the associated donors are more likely to donate as soon as they are allowed, which indicates more productive donors.

Fig. 6
figure 6

Bayesian posterior prediction of \(P(W_1>t|T_{0}=0, \varvec{x}, \text {parameters})\) in the first 90 days in which each profile is allowed to donate after his/her first donation. 95% credible bounds are added as dashed lines

Profile 9 is the most productive, which refers to a woman who starts to donate at 20 years old, with a healthy weight, active life, no smoking, no alcohol and rare blood type AB with negative Rh factor. This agrees with comments made in Sect. 3 about exploratory data analysis, namely women donate more often than men, subject to legal obligations. Furthermore, the posterior analysis (Sect. 5) showed that \(\beta _{\text {SEX}}<0\) for women, providing evidence that women have a higher recurrence intensity compared to men. The next three best profiles (Profiles 3, 6 and 7) all represent middle-aged female donors. The posterior predictive probability that they will donate after at least t days from the first donation declines very rapidly in the first three weeks after \(\varPhi _l\) and later approaches to zero. On the contrary, Profiles 0 and 8, both referring to young men, show the lowest propensity to donate as the associated posterior predictive survival probability decreases more slowly. The trajectories of male donors are clearly grouped by age: one group corresponds to middle-aged Profiles 2, 4 and 5, and the other to young Profiles 0 and 8. As for blood group and Rh factor, we observe from Fig. 6 that the effect of GROUP_AB together with negative Rh factor is significant for women only; in particular, note how Profiles 1 and 9, both corresponding to the same female profile but with different blood group and Rh factor, differ so much. At the end of the three months, all profiles but Profile 2 show a posterior predictive survival probability that approaches to zero, meaning that they are highly likely to have made the second donation within the analyzed time window. To conclude, among these profiles, we suggest AVIS Milan to target new recruitment campaigns for middle-aged men and young women.

6.2 Prediction

The estimate of the daily production of blood units can be obtained from the posterior inference, by combining the predictions of all donors who might donate within a given horizon. Since our model does not account for the arrival of novel donors but only for the recurrent donations process, we can only estimate the number of blood units from the donors already in the sample. This represents a lower bound for the estimate of the total number of blood units that can be collected.

The number of donations from the donors in the dataset that will take place in the next \(\tau\) days after June \(30\)th, 2018 (last day of observation in the database) can be expressed as:

$$\begin{aligned} \ N(\tau ) = \sum _{i=1}^I \textbf{1}\left( N_i(c_i+\tau )-N_i(c_i)=1| \varvec{\beta }, \varvec{u}, {\varvec{\lambda }}_0 \right) , \end{aligned}$$
(14)

with (conditional) expected value:

$$\begin{aligned} \mu (\tau ) = {{\,\textrm{E}\,}}\left( N(\tau )| \varvec{\beta }, \varvec{u}, {\varvec{\lambda }}_0 \right) = \sum _{i=1}^I P \left( N_i(c_i+\tau )-N_i(c_i)=1 | \varvec{\beta }, \varvec{u}, {\varvec{\lambda }}_0 \right) . \end{aligned}$$
(15)

A Bayesian estimator \({\widehat{\mu }}(\tau )\) of \(\mu (\tau )\) is given by its posterior mean:

$$\begin{aligned} {\widehat{\mu }}(\tau ) = {{\,\textrm{E}\,}}\left( \mu (\tau ) | \text { data} \right) = \sum _{i=1}^I {{\,\textrm{E}\,}}\left( P \left( N_i(c_i+\tau )-N_i(c_i)=1 |\text { data} \right) \right) . \end{aligned}$$
(16)

Such conditional expected value \(\mu (\tau )\) should be at least equal to the production target. On the contrary, an average number of donations higher than the target is not considered as a problem, because extra units can be stored or distributed to neighboring facilities. In addition, potential recurrent donors could be asked to delay the next donation to periods of underproduction.

The at-risk indicator \(Y_i(s)\) of donor \(i=1, \ldots , I\) at day \(s \in ( c_i, c_i+\tau ]\) turns out to be:

$$\begin{aligned} Y_i(s) = {\left\{ \begin{array}{ll} 1 \qquad \text {if } s-t_{i n_i}\ge \varPhi _i \\ 0 \qquad \text {otherwise} \end{array}\right. } \end{aligned}$$
(17)

where \(t_{i n_i}\) is the day of the last donation for donor i, and \(\varPhi _i\) is again equal to 85 for men and 150 for women. In the light of those remarks, it follows from Eq. (6) that:

$$\begin{aligned} P \left( N_i(c_i+\tau )-N_i(c_i) =1 | \varvec{\beta }, \varvec{u}, {\varvec{\lambda }}_0 \right) = 1 - \exp \left\{ - \int _{c_i}^{c_i+\tau } u_i(s) \textrm{e}^{\varvec{x}^{\prime }_{i} \varvec{\beta }} Y_i(s) ds \right\} \end{aligned}$$
(18)

with Bayesian estimator \({\widehat{p}}_{i}:= {{\,\textrm{E}\,}}\left( P( N_i(c_i+\tau )-N_i(c_i)=1) |\text {data} \right)\) given by:

$$\begin{aligned} {\widehat{p}}_{i} = {\left\{ \begin{array}{ll} 1 - {{\,\textrm{E}\,}}\left( \exp \left\{ -\int _{c_i}^{c_i+\tau } u_i(s) \textrm{e}^{\varvec{x}^{\prime }_{i}\varvec{\beta }} ds \right\} | \, \text {data} \right) &{} \text{ if } t_{i n_i}+\varPhi _i \le c_i \\ 1 - {{\,\textrm{E}\,}}\left( \exp \left\{ -\int _{t_{i n_i}+\phi _i}^{c_i+\tau } u_i(s) \textrm{e}^{\varvec{x}^{\prime }_{i}e\varvec{\beta }} ds \right\} | \,\text {data} \right) &{} \text{ if } c_i\le t_{i n_i}+\varPhi _i\le c_i+\tau \\ 0 &{} \text{ if } t_{i n_i}+\varPhi _i \ge c_i+\tau . \end{array}\right. } \end{aligned}$$
(19)

Let us now focus on a time horizon of either a week (\(\tau =7\)) or a month (\(\tau =30\)) after June, \(30^{th}\) 2018, which are of practical importance. Figure 7a reports the estimate of \(\mu (7)\) and \(\mu (30)\) and shows consistent values, since \(\mu (30)\) is approximately equal to four times \(\mu (7)\). Figure 7b shows the estimate of \(\mu (30)\) per blood type, in which the proportion between groups is consistent, with the largest number of blood units for groups 0 and A, followed by group B. Compared to AVIS Milan practice, the observed monthly number of blood units produced in 2018 was approximately 1500. However, note that this number includes not only the donations from the donors in our dataset, but also from non-recurrent donors and from recurrent donors already active before January \(1\)st, 2010. In particular, the donations from the donors in our dataset (for which we make a prediction) represent about 60% of all donations recorded by AVIS Milan in the time window we focus on, corresponding to about 30 donations recorded per day. Therefore, our \(\mu (7)\) and \(\mu (30)\) are coherent.

Fig. 7
figure 7

Estimated number of donors \(\mu (\tau )\) for the next week \(\tau =7\) and the next month \(\tau =30\), after June \(30\)th, 2018 (a); estimated number of donors \(\mu (30)\) grouped by blood group (b)

7 Conclusions

Uncertainty is a major issue when managing health care facilities and affects the performance of the service. This is particularly true when considering blood collection centers, which merge the features of a service provider and those of a production system. The ability to predict the times of donations is a key point to guarantee an adequate production of blood units and at the same time to properly manage the resource at the blood collection center. We focus our attention on two points in particular, profiling and prediction, and we propose a statistical prediction model that quantitatively supports these tasks.

As for profiling, the use of the proposed model can make the process of acquiring further donors more efficient, directing interventions towards donors who, once enrolled, will donate as often as they are allowed by the law. Our model indicates the most productive donor profiles among a set of possible alternatives. As for prediction, our model is able to predict the amount of donations in the next days. However, our approach considers only recurrent donors entered after the initial date of the time window, and provides estimates only for them. Therefore, when evaluating the number of donations in a time horizon, those from non-recurrent donors and recurrent donors already active before the time window should be added. To predict the number of non-recurrent donors, we could consider a model describing the arrivals of novel donors; however, according to AVIS Milan staff, in recent years the amount of donations from novel donors seems negligible with respect to the others. Note that we have excluded recurrent donors already active before the time window since their inclusion would require left-censoring in the likelihood in addition to right-censoring. This extension is among our future works.

General indications can also be drawn for the case of AVIS Milan, which may result in useful qualitative knowledge, supported by quantitative analyses, to be provided to the staff. In particular, our analysis highlights a decreasing trend of the baseline intensity function. It also identifies individual features (sex, smoking habits, alcohol consumption, physical activity, BMI, Rh factor, blood group, age at first donation, hemoglobin and minimum pressure) that most influence the intensity function and, hence, determine donors’ personal propensity to donate. Also, the interactions between sex and hemoglobin, Rh factor and BMI are found to be significant in differentiating donors’ behavior.

Our method can be immediately applied to the AVIS Milan case and also to other blood collection centers. In the latter case, it is enough to compute the posterior distribution of the parameters, given the different data. Then, the results of similar analyses on other databases can provide helpful information for centers. In conclusion, we have proposed a model that has proven to be an effective solution for profiling and prediction needs, and that can be immediately used in any blood collection center.