Background

When analyzing data collected in epidemiologic population-based studies, many statistical methods have been reported to handle missing data, frequently observed in this setting. These missing data may concern baseline predictors though more frequently follow-up data, including the study outcome, due to study dropouts. Excluding these cases from analyses possibly results in so-called attrition bias [1]. Indeed, such a complete case analysis is only valid if the outcome is missing completely at random (MCAR), so that conclusions about the population of followed subjects also apply to the population of those who dropped out [2]. Otherwise, simple imputation methods such as last observation carried forward or including missing data indicators in regression models, may also lead to overly precise or even biased estimates for relationships of interest when the missing data are missing at random (MAR) or even MCAR. To avoid biases and incorporate the imprecision due to imputation rather than observation, a widely applicable approach is multiple imputation [3]. Its interests in the handling of missing values from baseline data [4] or longitudinal data in epidemiologic contexts [57] have been shown although multiple imputation is still underused and reported [810]. Notably, although it is recommended to include the outcome in addition to the covariates in the imputation model [11, 12], it is not so commonly used in epidemiological studies.

We focused on multiple imputation by chained equations (MICE), that has been advocated and developed as the most appropriate, flexible and practical approach to handle missing data – including missing outcomes – in complex surveys under MAR mechanisms [11], though other techniques have been also proposed [13]. When this MAR assumption can be supported by the data collection, it provides asymptotically unbiased estimates and standard errors, and is asymptotically efficient when correct models are specified for the imputation. Notably, the distribution of each variable with missing values should be properly modeled, and the imputation model should include at least all variables of the analysis model namely all the predictors as well as the outcome [14, 15].

Polyphenols represent a complex family of natural plant-based molecules occurring in most plant foods (such as fruits, cereals, vegetables, chocolate, wine… ) consisting of more than 500 identified compounds in the human diet, ranging from low-molecular weight phenolic acids to highly polymerized proanthocyanidins [16, 17], with varying levels of bioavailability and biological properties [16, 18, 19]. Many studies (particularly laboratory experiments using animal models or cultured human cell lines) support an antioxidant role of polyphenols. However, very few of them have ever explored the association between polyphenols and oxidative stress in large samples from the general population.

This study aims to report the interests of handling missing values based on MICE, using as an illustrative example, the study of the long-term relationship between subclasses of dietary polyphenols intake and the thiobarbituric-acid-reactive substances (TBARS), a marker of oxidative stress [20]. This original analysis was performed using data from the SU.VI.MAX (SUpplémentation en VItamines et Minéraux AntioXydants) study, nutritional randomized primary prevention trial which aimed to study the association between antioxidant vitamins supplementation at nutritional doses and health events (1994–2005), and its follow-up cohort study (2008–2009), defining the SU.VI.MAX2 study. As it was an ancillary analysis, there were some incomplete data and we decided to provide interests of multiple imputation techniques in the analysis of such data. Thus, we studied the impact on the results of using complete case analysis or multiple imputation models, as well as the impact of the subsample chosen (including or not subjects without outcome available) to apply the multiple imputation method.

Methods

Motivating example

We used data from the SU.VI.MAX study, a randomized, double-blind, placebo-controlled, primary prevention trial that was initially designed to evaluate the effect of daily supplementation with antioxidant vitamins (E, C, and β-carotene) and minerals (selenium and zinc) at nutritional doses on the incidence of cancer and ischemic heart disease (Trial Registration clinicaltrials.gov Identifier: NCT00272428) [21]. Subjects were included between October 1994 and May 1995 for a planned 8-years supplementation. Volunteer subjects had to fulfil the following eligibility criteria: (1) being 35–60 and 45–60 years old respectively for women and men, (2) declare themselves free of any disease that might compromise participation, (3) not be taking supplements with vitamins or minerals provided for the trial, (4) applying protocol constraints, especially that of receiving a placebo; and (5) express no ambiguous motivations or obsessional behavior concerning diet and health [22]. At inclusion in SU.VI.MAX, all the 12,741 participants fulfilled questionnaires related to socio-demographics, smoking status, physical activity and diet. Dietary data were collected through 24h dietary records, and polyphenol intakes were computed for subjects with at least six 24h records available in the first two years of the follow-up, with a specific distribution: at least three in the summer and three in the winter, to account for seasonal variation in intakes. Polyphenols intakes of each participant were estimated from 24-h dietary records through the Phenol-Explorer database. This database is the first comprehensive electronic database on polyphenols contents in foods and beverages, implemented for the first time in 2009, which provides data on a total of 502 different polyphenols from 452 different foods [23]. All data are available using an open access website which enables to identify the correspondence between food and its polyphenols content (type and amount of each individual polyphenols). Overall, seventeen dietary subclasses of polyphenols intakes (namely anthocyanins, chalcones, dihydrochalcones, dihydroflavonols, proanthocyanidins, theaflavins, catechins, flavanones, flavones, flavonols, isoflavonoïds, hydroxybenzoic acids, hydroxycinnamic acids, other phenolics acids than hydroxybenzoic or hydroxycinnamic, stilbenes, lignans, other polyphenols) were identified [24].

At the end of the supplementation, 4,129 subjects with available dietary intake data and who agreed to participate were included in the optional SU.VI.MAX 2 study [25]. Among them, 2,116 (51.2%) patients, who were selected according to geographical criteria, underwent at 12 years following the SU.VI.MAX randomization visit, a measure of thiobarbituric-acid-reactive substance (TBARS). Indeed, due to operative and logistical obligations (TBARS measurement requires specialized laboratories), this plasmatic oxidative biomarker was only measured for a subsample of about one half of subjects.

Epidemiologists were interested in exploring the associations between subclasses of dietary polyphenols and TBARS plasma concentration. Given one-half of the patients had an available outcome measure, statistical analysis should handle the missingness of the outcome, besides that of the potential confounders from subject self-reporting questionnaires at inclusion.

Statistical analysis

Qualitative parameters were expressed as numbers (percentages) and quantitative parameters as median (interquartile range [IQR]). We compared categorical variables using Fisher’s exact tests and Chi-square tests as appropriate and continuous variables using Kruskal-Wallis test.

We used the multiple imputation with chained equations algorithm (MICE) to impute the missing data. The key concept of this sequential method is to use the distribution of the observed data to estimate a set of plausible values for the missing data, incorporating random components to reflect the uncertainty of these estimates. Multiple data sets are created, analyzed individually but identically to obtain a set of parameter estimates, and then combined to obtain the overall estimates, variances and confidence intervals, reported as Rubin’s rule [26]. Several imputation models, one for each variable with missing values, are to be defined [11].

All potential confounders -- selected by univariate analyses at the 0.2 level (this threshold was chosen to insure that most potentially prognostic variables have been selected so far), or identified from the literature -- namely age, intervention group, smoking status at inclusion, number of dietary records, vitamin C and β-carotene were included in the imputation models [11], as well as auxiliary variables (year of inclusion, physical activity at inclusion, educational level, smoking status) [11, 27] and the outcome (TBARS) [11, 12, 15].

We used predictive mean matching algorithm (pmm) for quantitative variables, and logistic regression (logreg) for binary variables, and polytomous logistic regression (polyreg) for categorical variables [28]. All multiple imputations used 50 imputed data sets.

Associations between each subclass of dietary polyphenols and plasma TBARS measure were evaluated using univariate and multivariate linear regression models. The relationship between polyphenol subclasses and TBARS was not linear, therefore polyphenols subclasses were divided into quartiles on the whole population in order to assess the relationship between polyphenols and TBARS when avoiding such an assumption. Potential confounders were selected by univariate analyses at the 0.2 level, or identified from the literature - namely age, sex, intervention group, smoking status at inclusion, number of dietary records, energy intake, and alcohol intake at inclusion. A backward selection procedure was used, though the former seven confounders were forced in the final model.

We assessed and compared the selected predictors when dealing with various samples. First, we considered the subsample of the 2,116 subjects with available TBARS, with either no missing confounders (complete case analysis: n = 1,112) or with imputed confounders. Then, we considered the whole sample of the 4,129 enrolled subjects of SU.VI.MAX2, handling the missing values in the outcome itself to reach valid inferences on the whole population.

All statistical tests were two-sided, with P-values of 0.05 or less denoting statistical significance. Statistical analysis was performed on R 3.2.0 (http://www.R-project.org/) and multiple imputation analysis used the R package “mice” [29].

Results

Figure 1 displays the study Flow Chart. Among the 4,129 included subjects, 2,013 (48.9%) had a missing outcome, including 734 who also had at least one missing covariate (Table 1). Comparison of patients with or without measure of the outcome is reported in Table 2. Actually, though based on geographic criterion, when comparing more specifically the group with measured TBARS and the one with no measured TBARS, the TBARS measure was related to participant’s characteristics at a threshold level of p = 0.10 such as gender, age, selenium, serum concentration of α-tocopherol and retinol. This suggests that a not missing completely at random (MCAR) but possibly at random (MAR) underlying mechanism.

Fig. 1
figure 1

Flowchart of the study. aCovariates considered are those included in the analysis model (but not the “auxiliary” covariates only used in the imputation model)

Table 1 Amount of covariates missing values among subjects with missing outcome and those without missing outcome
Table 2 Baseline characteristics of subjects according to availability of TBARS measure and the 15 selected predictors

Univariable analyses

First, based on complete case analysis (n = 1,112), there was no evidence of any association between subclasses of dietary polyphenols and the TBARS (Fig. 2). Then, these associations were studied after multiple imputation excluding individuals with missing outcome or not.

Fig. 2
figure 2

Comparison of estimated regression coefficient (with 95% CI) of the subclasses of polyphenols (namely flavonols, stilbenes, proanthocyanidins, dihydroflavonols, catechins, hydroxybenzoics acids, theaflavins, flavones) that had univariable prognostic value on TBARS based on at least one sample according to the handling of missing values. Each subclasses was defined as a 4-class variable according to the quartiles of polyphenols subclasses, namely below the First quartile (Q1), between Q1 and the median (Q2), between Q2 and the third quartile (Q3) and above Q3; note that the first class (<Q1) was used as the reference one. Abbreviations: CCA: complete case analyses; subjects with TBARS: multiple imputation with chained equations of missing covariates (n = 2116); all subjects: multiple imputation of missing covariates and outcome (n = 4129)

While no significant relationship between total dietary polyphenols and TBARS was found, eight subclasses of polyphenols were selected by MICE univariate analyses as associated with TBARS at a 0.20 level, whatever the sample, namely proanthocyanidins, dihydroflavonols, theflavins, flavones, catechins, flavonols, hydroxybenzoic acids and stilbenes (Fig. 2). Note that standard errors of regression coefficients and thus widths of the 95% confidence intervals were larger for complete case analysis (CCA) than for MICE estimates, that were close when performed in the sample of patients with or without the outcome (Fig. 2). Note also that, besides these polyphenols, nine other variables were selected as having prognostic information, namely weight and height, gender, retinol levels, energy intake, alcohol intakes, α-tocopherol, selenium and zinc. We also incorporated the six other prognostics covariates from literature namely age, intervention group, smoking status at inclusion, number of dietary records, vitamin C, β-carotene. Thus a total of 23 covariates: 8 subclasses of polyphenols exposure variables (namely proanthocyanidins, dihydroflavonols, theaflavins, flavones, catechins, flavonols, hydroxybenzoic acids and stilbenes) and 15 confounding factors, namely weight and height at inclusion, gender, retinol levels, energy intake, alcohol intake, α-tocopherol, selenium and zinc, age, intervention group, smoking status at inclusion, number of dietary records, vitamin C, β-carotene were finally included into the multivariate model.

Multivariable analyses

Table 3 displays the comparison of multivariate models based either on complete cases or after MICE. When excluding patients with missing outcome, only two polyphenol subclasses namely catechins and hydroxybenzoic acids levels, were selected as associated with the outcome, while only the predictive value of catechins was observed after imputing the outcome itself. Of note, while higher the catechins higher the expected TBARS (p = 0.0033), the association between hydroxybenzoic acids and TBARS was negative: higher the hydroxybenzoic acids lower the expected TBARS (p = 0.027). Finally, as compared to the complete case analysis, two other variables were selected by the multiple imputation procedures only, namely energy intake and β-carotene level while in contrast, the predictive value of age disappeared; these findings were similar whatever we imputed the outcome or not.

Table 3 Multivariatea model of the association of polyphenol intakes with TBARs according to the method of handling missing values

Discussion

Missing data are a common burden of epidemiological studies. When faced to such missing data, the assumptions with regards to their underlying mechanisms need to be considered cautiously and correctly to avoid biases and/or inefficiency in estimates of any exposure on the outcome. Accordingly, Little and Rubin’s classification of missing data is widely used to segregate their mechanisms, distinguishing three main types of missing data [30]. When the assumption of MCAR mechanism is violated but the MAR can hold, the use of multiple imputation approaches has been shown a valid and simple approach to deal with missing covariates and even outcome [11]. However, it is still seldom used and reported in epidemiological settings. Indeed, in addition to the incompressible time between any statistical innovation and its practical use, multiple imputation techniques require a good understanding and knowledge to be correctly applied. Notably, its implementation is not straightforward. Finally, complete case analysis method (i.e. excluding all subjects with missing value on at least one covariate) is the default method used by most of the statistical softwares, causing it easier to use and largely implemented in epidemiologic studies. We thus attempted to illustrate their use and the importance to be explicit about the multiple imputation procedure when assessing the impact of polyphenols on oxidative stress which is indeed a subject of major concern for epidemiologists. Actually, an adverse effect role of oxidative stress has been previously suggested in brain injury [31], carcinogenesis process [32], or cardiovascular diseases [33], and a protective role of polyphenols has been shown in cancers [34], or dementia [35].

We used data from SU.VI.MAX 2, which is a large cohort study conducted in the general population. The outcome was the TBARS concentration, which has also been recently used as an endpoint in clinical trials of selenium supplementation in patients with Type 2 diabetes [36], of exercise training in hemodialysis patients [37]. The association between catechins and biomarkers of oxidation measured with TBARS has been the focus of interest of many studies, particularly because of the abundance of catechins in the human diet. The impact of catechins on TBARS was mostly evaluated on animal studies while studies in humans were based on small samples, with conflicting results [38]. Studies on the relationship between catechins and oxidative stress are inconsistent [39] possibly related to the different amount of catechins ingested by subjects according to the study design. For example, Tinahones et al., in a study of 14 healthy women, showed that the consumption of green tea extract for five weeks was associated with a significant 37.4% reduction in the concentration of oxidized LDL (TBARS) (p = 0.017) [40]. Nantz et al. [41] showed, in a study on 111 healthy volunteers that after 3 weeks taking Camellia sinensis compounds twice a day serum malondialdehyde levels was 11.9% lower compared to baseline levels in the intervention group compared to the placebo group. On the contrary, Gomikawa et al. showed that TBARS contents in plasma were not changed after green tea consumption [42]. In our study, the relationships between catechins or acids hydroxybenzoic and TBARS were not linear, neither in univariate nor in multivariate analyses whatever the imputation model. Based on multivariate models from imputed datasets, high acids hydroxybenzoic intakes above the third quartile (Q3) were negatively associated with TBARS, while catechins intake higher than Q3 were selected as positively associated with TBARS; thus while the former subclass of polyphenols appears antioxidant, the latter appears to increase the oxidative stress. This result is quite unexpected since polyphenols have usually been reported as antioxidants. However, most of these studies were in vitro experimental or in vivo animal models studies with polyphenols at pharmacological doses, while the current study examines the role of polyphenols at nutritional doses in humans. Moreover, at high doses, polyphenols have also been shown to exert pro-oxidative effects (e.g., increased expression of phosphorylated histone 2AX and metallothionein, markers of DNA damage and response to oxidative stress, respectively). These prooxidative activities may be involved in hepatic and gastrointestinal toxicities observed in animals and humans [43, 44]. Nevertheless, the shape of the relationship between catechins and TBARS, should be stressed out. These results are driving interest in further explorations of the association between polyphenols intake and TBARS in humans.

Of note, in the complete case analysis, none of the polyphenol subclasses was found to be associated with TBARS. Moreover, the strength of association as measured by the estimated regression coefficients were affected by the handling of missing value approach, with for instance coefficients divided by two for the randomized supplementation group and the number of dietary records, while two-fold for others such as energy intake. At last, some participants’ characteristics such as sex, age, alcohol intake, zinc or selenium, differed according to the availability of a TBARS measure, suggesting that the probability of data being missing may depend to the observed data (that is, MAR). This mechanism excludes the complete case analysis as the basis of conclusive findings. However, it is not possible to further distinguish between MAR and MNAR (missing not at random, i.e. when the probability of missing data depends on unobserved data) from the observed data alone, although the MAR assumption may appear more plausible in this series due to the large collection of many explanatory variables included in the analysis. Otherwise, when the MNAR appear likely, other statistical approaches should be used [4547].

According to previous recommendations, the imputation models included all variables planned to be in the analysis model including the outcome [11]. While it has been recommended to include the outcome in the imputation model, this point is somewhat delicate: as underlined by Moons et al. [48] it could seem, on the contrary, a self-fulfilling prophecy to use the outcome to impute data studying the existence of a potential association between the covariates and the outcome. As recommended, the TBARS outcome was systematically included in the imputation models whatever the sample of interest. Whatever the imputation model, the same predictors (except the hydroxybenzoic acids) were selected with close estimated effects, while differences in patient characteristics suggested some selected population from geographic criterion (Table 2). Thus, the close estimations of the two models can also be explained by the fact that these models were adjusted on those discrepancies. Moreover, the hydroxybenzoic acids were not selected when the imputation also applied to the outcome. The high proportion of missing outcome may have influenced these results, with about 50% of the outcomes that had to be imputed. However, Moons et al. [48] showed that imputation of such a high proportion of missing values still provided less biased results compared to complete case analysis. Thus, further studies are necessary to infirm or confirm this observation and the shape of the relationship.

Some limitations of our study could be advocated. First, the measurements of dietary intakes were performed at baseline whereas the outcome was measured about 12 years thereafter. Thus, it could be difficult to highlight the associations between polyphenols and TBARS due to changes in dietary intakes that may have confounded the estimation. Moreover, we used TBARS measure to explore the relationship between polyphenols subclasses and oxidative stress, which was the only oxidative stress biomarker collected in SU.VI.MAX2. Measure of oxidative stress is complicated and it is difficult to declare which oxidative stress biomarker is the best. TBARS represent an interesting marker of oxidative stress since they have been hypothesized to represent a composite number of oxidative damage products [49], however the assay of serum or urine isoprostanes as an oxidative biomarker is now frequently used as a gold standard. Further researches dealing with this association and using another oxidative stress biomarker than TBARS would be of great interest to confirm our results. Besides, the assessment of polyphenols intakes through dietary records is subject to self-reporting bias, despite the fact that repeating 24-h dietary records constitutes an accurate and efficient measurement of polyphenols intakes [50]. However, taking into account missing data with multiple imputations limit the over-selection bias issue analyzing subjects, regardless of the availability of missing covariates. Then, multiple imputation with chained equation procedure has some limitations among which are it lack of a theoretical justification, it MAR assumption sensitiveness increasing with the number of missing data, the possible non convergence insofar as it is an iterative procedure [11]. However, multiple imputation by chained equations, compared to other methods (CCA, median imputation…) remain a less biased methods to handle missing values. At last, the multiplicity issue should be pointed out that some false positive results could have occurred.

Conclusions

In summary, we have provided some results on the association of dietary polyphenols subclasses intake and a validated biomarker of oxidative stress, taking into account missing data on both the covariates and the outcome. The likely missing at random underlying mechanisms allowed using multiple imputation approach, allowing to suggest the only predictive value of catechins among the 17 subclasses of polyphenols on the oxidative stress.