Serum metabolites associated with wholegrain consumption using nontargeted metabolic profiling: a discovery and reproducibility study

Purpose To identify fasting serum metabolites associated with WG intake in a free-living population adjusted for potential confounders. Methods We selected fasting serum samples at baseline from a subset (n = 364) of the prospective population-based Kuopio Ischaemic Heart Disease Risk Factor Study (KIHD) cohort. The samples were analyzed using nontargeted metabolomics with liquid chromatography coupled with mass spectrometry (LC–MS). Association with WG intake was investigated using both random forest followed by linear regression adjusted for age, BMI, smoking, physical activity, energy and alcohol consumption, and partial Spearman correlation adjusted for the same covariates. Features selected by any of these models were shortlisted for annotation. We then checked if we could replicate the findings in an independent subset from the same cohort (n = 200). Results Direct associations were observed between WG intake and pipecolic acid betaine, tetradecanedioic acid, four glucuronidated alkylresorcinols (ARs), and an unknown metabolite both in discovery and replication cohorts. The associations remained significant (FDR<0.05) even after adjustment for the confounders in both cohorts. Sinapyl alcohol was positively correlated with WG intake in both cohorts after adjustment for the confounders but not in linear models in the replication cohort. Some microbial metabolites, such as indolepropionic acid, were positively correlated with WG intake in the discovery cohort, but the correlations were not replicated in the replication cohort. Conclusions The identified associations between WG intake and the seven metabolites after adjusting for confounders in both discovery and replication cohorts suggest the potential of these metabolites as robust biomarkers of WG consumption. Supplementary Information The online version contains supplementary material available at 10.1007/s00394-022-03010-x.


Introduction
Consumption of wholegrain (WG) cereals has been shown to convey various health benefits, such as lower inflammation markers [1] as well as reduced risk of type 2 diabetes [2], cardiovascular diseases, and colorectal and prostate cancer [3]. Fiber and phytochemical content have been suggested as the key components responsible for health benefits via modulation of, e.g., postprandial glycemic response and lowering serum LDL cholesterol [4,5]. In addition, the fiber content could potentially influence the gut microbial community [6], which may induce changes in the microbial metabolites and metabolic outcomes thereafter. To advance our understanding of the mechanisms by which WG influence health outcomes, dietary assessment is crucial. However, subjective reporting of dietary intake is prone to misreporting due to, e.g., recall bias, error in estimation of portion size, or giving 1 3 favorable or socially desirable answers. The application of both subjective reporting and objective measurement of biomarkers can provide complementary estimation of dietary intake, which may not be achievable using only either one of the approaches.
The FoodBall consortium has classified dietary biomarkers as indicators to reflect (1) the consumption of food, its compounds or components, or part of a dietary pattern, or (2) the effect or implicated physiological and health status [7]. In the case of WG, odd-chain alkylresorcinols (ARs) and their homologues have been widely explored as intake biomarkers of WG rye and wheat, while the even-chain ones seem to be specific for quinoa [5,8,9]. More recently, trimethylamine-N-oxide and various betainized compounds have been reported from consuming a WG-rich diet [10,11]. In addition, lower levels of several endogenous compounds, such as serotonin, taurine, and glycerophosphocholine, and phosphatidylcholines (PCs) have also been reported after WG intake [12,13]. However, the metabolism of these compounds in the body seems to depend on individual factors, such as age, sex, and BMI [14,15]. In addition, many factors covary with habitual WG intakes, such as higher physical activity, lower tendency to smoke, and lower alcohol consumption [16]. On top of that, the risk of non-or low compliance in the intervention studies [17] may make it more complicated to disentangle the effect of individual factors on WG-associated metabolites. Hence, there is a need to establish a panel of diet-derived and/or endogenous metabolites associated with WG intake independent of confounding factors in free-living populations.
Applications of nontargeted metabolomics in health sciences have been shown to reflect the contribution of intrinsic and (semi-) modifiable factors, including genetics [18], endogenous metabolic pathways, and gut microbiota [19], as well as lifestyle factors, such as diet [20], stress [21], and other environmental exposures [22]. Profiling the blood metabolome may hence provide information about lifestyle, environmental exposure, and other information about the individuals, including biological mechanisms underlying the relationship between nutrition and health [22][23][24].
Here we present the application of nontargeted metabolic profiling to assess blood metabolites associated with WG consumption in a prospective population-based cohort study. Based on the presumed causal relationship between WG intake and the blood metabolome, associations were adjusted for confounders (age, BMI, smoking, physical activity, energy and alcohol consumption). Finally, the discovered metabolites were checked if they could be replicated in an independent subset.

Study population
The samples for this study were obtained from the Finnish middle-aged male participants of the Kuopio Ischaemic Heart Disease Risk Factor Study (KIHD). KIHD is an ongoing population-based prospective cohort study in Eastern Finland [25]. The baseline examination took place in [1984][1985][1986][1987][1988][1989]. 2682 men aged 42-60 years (83% of those who were eligible) participated in the baseline examinations.

Dietary assessment
Participants self-reported their dietary intake at baseline using a 4-day food record [26]. To ensure reporting accuracy, the participants received instructions on how to fill out the food record and a picture book containing a list of 126 foods and drinks typically consumed in Finland during the 1980s. Each item included a corresponding estimation of portion size based on household measures to ensure proper assessment and recording [27]. During a study visit, a nutritionist checked the completed food records with the participant to improve accuracy [25].
The definition of WG followed the definition by the HEALTHGRAIN project [28], including downstream products, such as pasta. The KIHD database does not include information on intakes of individual grains. In the mid-to-late 1980s in Finland, wheat and rye were the most commonly consumed grains, followed by oat, rice, and barley [29]. However, in the KIHD cohort, WG pasta or rice intake was very uncommon ( Table 1). The calculation of food and nutrient intakes was performed using the NUTRICA ® 2.5 software (Social Insurance Institution, Turku, Finland), based mainly on the Finnish database of the nutrient composition of foods.

Selection of samples
Serum samples and data for this study were taken from two independent subsets within the KIHD cohort. The discovery cohort (DC) was selected from a previous study on adherence to a healthy Nordic diet and incidence of coronary artery disease within a mean follow-up of 20.4 years (n DC = 364) [30]. The replication cohort (RC) was taken from a study investigating the association between egg consumption and the incidence of type 2 diabetes after a mean Table 1 Baseline characteristics and dietary intake of study participants in each subset All values are presented in median ± interquartile range (IQR), except for proportion of current and past smokers. Dietary data are presented in 4-day-food-record median ± interquartile range (IQR) %E percentage of energy intake, SFA saturated fatty acids, MUFA monounsaturated fatty acids, PUFA polyunsaturated fatty acids a DC discovery cohort [30], RC replication cohort [31]  follow-up of 19.3 years [31]. From the original number of participants (n = 239), 39 participants were excluded, since they were already included in the DC (n RC = 200).

Collection of blood samples and other measurements
Blood samples were collected during the baseline examination visits in 1984-1989. Participants were instructed to abstain from alcohol consumption for 3 days and from smoking and eating for 12 h before examination visits between 08.00 and 10.00 on Tuesdays-Thursdays [32]. After 30-min rest in supine position, venous blood samples were drawn without a tourniquet [32]. Serum was separated by centrifugation at 2000g for 10 min (20 °C) after coagulation at room temperature for an hour [32]. The obtained serum samples were stored at − 80 °C until LC-MS analysis in 2016 for RC and 2018 for DC. Body mass index (BMI) was calculated as body weight (in kg) divided by the square of height (in m 2 ). The recording of habitual leisure-time physical activity [33], smoking and alcohol consumption in the past 12 months and measurement of blood pressure [34] have been described previously.

Metabolomics analysis
Sample randomization and preparation steps have been described in previous publications [30,31]. After the samples were thawed entirely on ice water for approximately 3 h, 100 µL of each sample was mixed with 400 µL of acetonitrile then pipetted into 96-well plate filter plate layered with 96-well plate. Centrifugation (700g, 4 °C, 5 min) was performed to obtain protein-free filtrate [35] which was directly used for LC-MS injection.
Data acquisition for nontargeted metabolic profiling analysis was performed at the LC-MS metabolomics center (Biocenter Kuopio, University of Eastern Finland). Two different LC-MS systems were employed for the DC and RC [30,31]. The LC systems for the DC and RC were Vanquish UHPLC (Thermo Fischer Scientific) and 1290 Infinity Binary UPLC (Agilent Technologies), respectively. Both systems utilized two chromatographic techniques: reversedphase (RP) (Zorbax Eclipse XDB C18, 2.1 × 100 mm, 1.8 μm, Agilent Technologies, Palo Alto, CA, USA) and hydrophilic interaction chromatography (HILIC) chromatography (Acquity UPLC ® BEH Amide 1.7 µm, 2.1 × 100 mm, Waters Corporation, Milford, MA, USA). The injection volume was 1 µL for each sample. A pooled sample from all biological samples per experiment was injected at the beginning and after every 12 samples throughout LC-MS run for quality control and drift correction.
The MS systems used Q Exactive Focus Orbitrap MS (Thermo Fischer Scientific) for DC and Agilent 6540 Q-TOF (Agilent Technologies) for RC [30,31], both with high resolution and accuracy. The data were acquired in both positive (ESI+) and negative (ESI−) electrospray ionization modes. At the end of the analysis, data-dependent MS2 were acquired for each mode. Further information about the LC-MS instruments setup and data acquisition parameters can be obtained from the previous publications [30,31].

Discovery cohort
Peak-picking was performed using MS-Dial version 4.20 [36] after converting the raw files to.abf format using Abf Converter. The data were collected with a tolerance of 0.01 Da for MS1 and 0.025 for MS2. Peak detection was performed with a minimum peak height of 10,000 for DC and 1000 for RC due to the different detection units. Preliminary identification was performed in MS-DIAL [36] against the uploaded in-house library with an identification score cutoff of 70% and accurate mass tolerance of 0.015 Da for MS1 and 0.05 for MS2. The tolerance for peak alignment was 0.015 Da and 0.15 min. After alignment, the raw peak area from each mode was then exported to .xlsx files. This data matrix contained 36,584 features from RP−, 30,607 from RP+ , 25,871 from HILIC−, and 15,095 from HILIC+ , which then underwent data preprocessing.
All features were preprocessed using the R package notame (https:// github. com/ anton vsdata/ notame) as previously described [21,35]. The procedures allow correction of drift due to long LC-MS run sequence, missing values imputation, and removal of low-quality signals [35]. Following this procedure, we retained 2829 and 1438 features from HILIC, and 6260 and 6957 features from RP, in ESI + and ESI−, respectively. Thus, the combined data matrix comprised 17,484 features from 364 participants in DC. Before statistical analyses, the peak areas of the features were transformed using log-transformation, followed by normalization by mean-centering and scaling to unit variance.

Replication cohort
The metabolomics data of the RC underwent a similar preprocessing procedure as DC described above. One data file from RP+ was corrupted during the peak-picking procedure, so the feature alignment of RP+ was based on 199 samples. The removal of low-quality features yielded 14,110 features from 200 participants in RC, which underwent the same normalization procedures as in DC.

Discovery cohort
The selection of features for the identification step employed both multivariate and univariate approaches. Random Forest (RF) using the R package MUVR (https:// gitlab. com/ CarlB runius/ MUVR) that incorporates a repeated double crossvalidation scheme was applied to unbiasedly select a set of molecular features ranked based on their importance to predict the total WG intake. Permutation tests (n = 40, p difference between actual and permutation models = 1.21e −14) were performed to ascertain that modeling results were not due to overfitting [37]. This variable selection procedure maximized the selection of all relevant features (max model), resulting in a selection of 130 metabolic features. These features were then fitted to a linear regression model (using the built-in lm function in R) with WG intake as the independent variable and the normalized metabolite feature as the dependent variable, followed by correction for multiple testing by false discovery rate (FDR). FDR < 0.05 was considered significant.
In addition to the feature selection using random forest, we also performed a partial Spearman correlation test to capture additional features that may not be selected by RF. The correlation test was performed between WG intake and peak area of all features after first regressing both WG intake and peak areas with confounders (age, BMI, leisure-time physical activity, smoking, and intake of alcohol and energy) using the built-in lm function. Residuals were then correlated using the built-in cor.test function in R. The cutoff of FDR < 0.005 was used to limit the annotation and discussion to a reasonable shortlist of likely relevant metabolites. Table 1) were checked if they were also detected in the RC. To estimate the RT of those features in RC, 46 metabolites with confirmed identity based on the mass-to-charge ratio (m/z), retention time (RT), and MS2 spectra from both DC and RC were fitted to a locally estimated scatterplot smoothing (LOESS) (Supplementary Table 2) using the built-in loess function in R. This number included some metabolites eluting at the range of RT uncovered by the relevant features as anchor points, although they were outside the scope of interest of the current study (Supplementary Table 2). The fitted LOESS was then used to predict (using the built-in predict function in R) the RT of the shortlisted features from DC without MS2 spectra in the replication cohort (RC).

annotated metabolites in the DC (Supplementary
Features with m/z tolerance of 5 ppm and RT tolerance of 0.5 min from either the RT in the discovery cohort (DC) or LOESS-predicted RT were added to the list of validated metabolites. In total, 61 metabolites with tolerance of mass-to-charge ratio (m/z) 5 ppm and retention time (RT) 0.5 min (Supplementary Table 2, Supplementary Methods) were found in the RC. Random forest was not applied to the RC, because RF did not seem to fit the current subset (Q2 = 0.03). The reason could be the selection criteria of the study population which were based on egg intake [31] and were not related to WG intake. Hence, these metabolites were then subjected to the same Spearman correlation and linear regression models as in the DC (Supplementary  Table 3, Supplementary Methods).

Adjustment for potential confounders
Based on presumed causal relationships depicted in a directed acyclic graph [38] (Supplementary Fig. 1), variables associated with both WG intake as exposure and blood metabolome as outcome were identified as potential confounders. These selected confounders were age, BMI, leisure-time physical activity (kcal/day), smoking (estimated as cigarette packs per day multiplied by years of smoking), and intake of and alcohol (gr/week) and energy (kcal/day). In particular, energy intake was included as a standard multivariate model [39]. These confounders were adjusted for in partial Spearman correlations between WG intake and metabolic features and in adjusted linear models in DC. Both were followed by FDR adjustment. FDR < 0.005 for correlation analysis and FDR < 0.05 for the linear models were considered significant.
The same set of confounders were also adjusted for in the Spearman correlation and linear regression model in the RC, except for smoking, since only one RC participant smoked. FDR < 0.05 for either correlation or linear models was considered significant in the RC. All statistical analyses were performed using R version 4.0.3 [40].

Compound annotation
Features in the DC with FDR < 0.05 in linear modeling (n = 112) or FDR < 0.005 in correlation analysis (n = 245) were added to the shortlist for compound annotation (Fig. 1). The list was further narrowed down by limiting molecular mass < 1000 Da, RT 1-12 min for HILIC and 1-15.5 min for RP modes, leaving 270 features for annotation.
Extracted ion chromatograms and MS2 spectra of differential metabolites were visualized using Freestyle 1.3 (Thermo Fisher Scientific) for annotation purposes. Metabolite annotation was performed based on matching mass, isotopic pattern, and MS2 spectra against existing libraries, either in-house for level I (together with matched RT with pure commercial compound run in the same platform) or online spectral databases (Supplementary Methods) for level II according to the guidelines from the Metabolomics Standard Initiative [41]. The utilized reference libraries for level II identification were MassBank [42,43], METLIN [44], HMDB version 4.0 [45], and Mass Bank of North America (MoNA). Lipophilic compounds were matched against the in-house library or built-in MS-DIAL library [36] and LIPID MAPS [46]. Phospholipids [47,48], dihydroxybenzoic acid [49,50], betaines [51,52], and alkylresorcinols [12,49] were annotated based on previously reported MS2 fragments. Features without data-dependent MS2 were subjected to targeted MS2 analysis using the previously described method [30]. Metabolites with compound class annotation based on the fragmentation patterns were reported as level III. Completely unknown compounds with unavailable MS2 data or lacking MS2 interpretation were reported as level IV [41].

Reproducibility study of metabolites previously associated with wg intake
Besides annotating metabolites from the discovery and replication strategies described above, we further annotated metabolites previously associated with WG intake [10,12,13,49,52,53] from the data. This list of metabolites included ARs, betaines, and other metabolites (Supplementary Table 4). In addition, due to the potential interaction between WG, endogenous metabolism, and gut microbiota [54,55], we also investigated the association between WG intake and some microbial metabolites (Supplementary Table 5) previously reported from gut microbiota or linked to the metabolism of benzoxazinoid or phenolic compounds [54,[56][57][58][59].

Results
Participants' characteristics at baseline and dietary intake data were reported as median (interquartile range (IQR)) ( Table 1).

Metabolites associated with wholegrain intake in the discovery cohort
After removing noise and redundant features or fragments from the same metabolites, 143 metabolites were associated with WG intake based on correlation or linear model after RF variable selection (Supplementary Table 1). Among them, 24 metabolites were directly associated, identified at level I or II (Table 2). Pipecolic acid betaine, aminophenol sulfate, tetradecanedioic acid, dimethoxyphenylpropenoic acid, hydroxyisoleucine, tryptophan, and sinapyl alcohol, were selected in both correlation and RF, followed by a linear model. Three glucuronidated odd-chain ARs were also found in this analysis, namely, AR 19:0-glucuronide, AR 19:1-glucuronide, and AR C21:1-glucuronide (Table 2).

Microbial metabolites and other wg-related target compounds
In addition to the data-driven approach, we also aimed to replicate compounds previously associated with WG intake or produced by gut microbiota. With this approach, we did not find any additional metabolites related to WG intake Fig. 1 Study flowchart. BMI body mass index, CHD coronary heart disease, DAG directed acyclic graph, DC discovery cohort, FDR false discovery rate, KIHD Kuopio Ischaemic Heart Disease Risk Factor Study, LC-MS liquid chromatography-mass spectrometry, LM linear regression model, LTPA leisure-time physical activity, MS2 tandem mass spectrometry, DC discovery cohort, RF max random forest with maximum variable selection, RT retention time, T2D type 2 diabetes, RC replication cohort, WG whole grain. *Sample selection criteria have been reported in previous publications according to a healthy Nordic dietary pattern, the incidence of coronary artery disease for DC and egg intake, and incidence of type 2 diabetes for RC [30,31].

Table 2
List of metabolites with level of identification I and II associated with wholegrain intake both in the discovery (DC) and replication cohorts (RC)   Table 4). However, the microbial metabolites indolepropionic acid, dihydroxybenzoic acid isomer, pyrocatechol sulfate, and hippuric acid correlated with WG intake in our data (Supplementary Table 5). Features with matching m/z as indoxyl sulfate, indoleacrylic acid, and two isomers of dihydroxyphenylacetic acid (DOPAC) were also associated with WG, but no MS2 data were available to confirm the annotation even after targeted MS2 analysis (Supplementary Table 5). These metabolites, except pyrocatechol sulfate, hippuric acid and a metabolite with matching m/z as DOPAC, retained their association after adjustment for confounders (Supplementary Table 1). However, when focusing on the RC, many of the microbial metabolites could not be found in the data, and those that were annotated, e.g., indolepropionic acid and hippuric acid, were not associated with WG intake (Supplementary Table 5).

Discussion
In this study, we observed associations between WG consumption and the levels of various metabolites in the fasting serum of middle-aged and older men from eastern Finland. Some metabolites, such as pipecolic acid betaine, tetradecanedioic acid, four glucuronidated ARs, and an unknown metabolite, retained their associations in both analyzed cohorts after adjustment for confounders (age, BMI, physical activity, smoking, alcohol, and energy intake). Pipecolic acid betaine and ARs have been previously associated with WG intake [5,6,10,12]. Pipecolic acid betaine was consistently at the top of the list with a correlation estimate of 0.398 and 0.328 after adjustment in the DC and RC, respectively (Table 2). This finding nominates pipecolic acid betaine as the serum betaine with the strongest association with WG in this study. We also found a consistent association between WG intake and four glucuronidated ARs in this study, with AR C23:1-glucuronide being associated only after adjustment for confounders. Similar to our findings, glucuronidated ARs have previously been reported to associate with WG intake in intervention studies [9,12]. The odd number of carbon atoms in their side chains highlights the preference of wheat and rye in the study population [8]. However, contrary to previous studies [60][61][62], we did not find free-form ARs or their metabolites, such as 3-(3,5-dihydroxyphenyl)-propanoic acid and 3,5-dihydroxycinnamic acid [63] in either the DC or the RC, which might be due to differences in the analytical methods and sample preparation techniques. WG intake was found to be associated with dihydroxybenzoic acid (Supplementary Table 5), but the position of the hydroxy groups needs to be confirmed with a reference compound.
Tetradecanedioic acid has previously been extracted from brown rice [64]. Because brown rice was not commonly consumed in Finland in the 80s, this finding may strengthen the previously found association between WG intake and dicarboxylic acids [56], though they have not gained much attention. Sinapyl alcohol constitutes lignin complex in the cereal bran [65] and has been reported to increase after a WG intervention [13]. In this study, it was associated with WG intake in both cohorts after partial correlation but only in adjusted linear models in the RC. This finding may showcase how applying several statistical approaches may enable data exploration from different angles. Consequently, to identify the most robust biomarker candidates, we focused our attention on the metabolites with observable associations in both the RF and the correlation-based approaches.
In the DC, WG intake was associated with some amino acids, namely, glutamine, hydroxyisoleucine, tryptophan, and gamma-Glu-Leu and gamma-Glu-Val. Glutamine, dihydroactiniolide, gamma-glutamylated peptide, and PCs lost their association after adjustment for confounders, suggesting that they might not have a direct association with WG intake. Tryptophan and hydroxyisoleucine, however, retained their association after adjustment. Furthermore, microbial derivatives of tryptophan, namely, indolepropionic acid, as well as metabolites with matching m/z as indoxyl sulfate and indoleacrylic acid, retained their direct association after adjustment in the DC. Other microbial metabolites, such as dihydroxybenzoic acid, also showed a positive correlation. These associations between WG intake with amino acids and microbial metabolites were in accordance with previous study reporting increased indoleacetic acid after rye consumption [66], which showed how WG consumption may influence an array of metabolic pathways, including protein and microbial metabolism [67].
In the RC, however, tryptophan did not associate with WG and hydroxyisoleucine lost its association after adjustment. Other microbial metabolites were correlated with WG after adjustment in the DC but could not be identified or lost their associations in the RC. This observation could be due to differences in the consumption patterns caused by different selection criteria between the DC (focus on the healthy Nordic diet [30]) and the RC (focus on egg consumption [31]), despite the same dietary assessment instrument. Similarly, we had previously shown how hippuric acid was related to WG intake in a dietary pattern with fatty fish and berries but not when it was enriched with WG alone [9]. Since the gut microbiome had a stronger association with dietary patterns than with individual dietary constituents [68], different consumption patterns could expectedly be reflected in the gut microbiome and, later, in the microbial metabolites. The variation in the levels of gut microbial metabolites hence might hinder their application as dose-dependent exposure biomarkers [69]. Likewise, the LC-MS instruments used to analyze DC and RC samples were different (LC-Orbitrap-MS vs LC-QTOF-MS, respectively). Therefore, the different analytical capabilities to detect, especially the minor compounds, cannot be ruled out. Despite the different analytical platforms, the repeated association of specific metabolites with WG intake in both the DC and the RC may highlight these metabolites as robust potential biomarkers of WG intake. Future replication in other populations, e.g., with both males and females, or of different age groups, would be necessary to further test the robustness of these metabolites. If these metabolites are proven to be robust across various populations, the next step would be to obtain absolute quantification of these metabolites to understand the kinetics, e.g., time-and dose-response, as well as to investigate the stability, reliability, analytical performance, and reproducibility across different laboratories [7] before they can be used as robust biomarkers of WG intake.
This study has several strengths. The reporting bias was minimized by comprehensive dietary recording accompanied by a picture book, household measurements to estimate the portion sizes, and checking by a nutritionist together with the participants. The WG consumption included the WG cereals in mixed dishes and recipes, which increases the accuracy of habitual intake assessment. Another key strength was the application of a robust metabolomics workflow with stringent quality control and compliance to widely accepted reporting guidelines. Replicated metabolite findings in the RC after adjustment for potential confounding increased the probability that covariates did not primarily drive the observed association. Also, the replicated findings based on two analytical platforms further underline their robustness. These findings may provide a basis for follow-up studies to quantify or examine a causal relationship or biological mechanisms.
There are also several limitations. First, the baseline samples in this cohort were collected during the 1980s, which require validation for current diets and food products. Alterations in the serum metabolome may occur with such prolonged storage even under proper storage conditions. However, this would likely affect all groups similarly and contribute to diluting results rather than systematic bias, which may partially explain the lack of associations. The possibility of not finding metabolites that have been completely degraded or decomposition of metabolites to smaller molecules under such a long storage also cannot be ruled out. Dietary intakes were based on a single 4-day food record, so we could not tell apart if the associated metabolites were due to recent or habitual exposure. Third, the effect of processing, such as sourdough fermentation, could not be distinguished in this study, though it may affect the conversion of WG-derived metabolites [50]. The study design did not enable investigating the causality between WG intake and the metabolic profile. Potential confounding from genetic factors was minimized by selecting men from eastern Finland with a common genetic ancestry [70]. However, it may also restrict the generalizability of results to women and other populations, which may nominate other metabolites as potential biomarkers of WG intake due to variations in the blood metabolome. The contribution of WG intake to the blood metabolome could not be separated from other favorable lifestyle factors, e.g., consumption of a healthy Nordic diet rich in root vegetables and berries, or physical activity. Although physical activity has been included as one of the confounders, it might have not fully accounted for the total contribution of physical activity to metabolic profile and its association with WG intake. Similar arguments would be valid also for other covariates we adjusted for, such as age, BMI, smoking, intake of energy, alcohol, as well as those we could not adjusted for, e.g., healthy Nordic diet, either as a dietary pattern or as individual components, which potentially coexist with WG intake. Hence, follow-up studies in other cohorts are required to validate the findings. The application of different LC-MS platforms for discovery and replication cohorts may raise a possibility of different detection capacity between both instruments, which was minimized by focusing on only metabolites appeared in both discovery and replication cohorts. Finally, univariate and multivariate data analysis strategies have different strengths and weaknesses, and which strategy is best suited for biomarker discovery from nontargeted metabolomics data is still yet unclear. Consequently, studies are de facto being performed using either or both strategies. We, therefore, chose to use both random forest followed by linear models and partial correlation under the rationale that both approaches were complementary. Thus, identifying metabolites that appeared using both techniques would provide a robust selection of biomarker candidates in this exploratory study.

Conclusions
We examined the fasting serum profile of middle-aged and older men in eastern Finland in relation to WG consumption. High consumption of WG was associated with higher levels of previously reported WG phytochemicals, such as pipecolic acid betaine and glucuronidated alkylresorcinols, as well as novel metabolites, such as tetradecanedioic acid and an unknown metabolite. The retained association after adjustment both in the discovery and replication cohorts showed the potential of these metabolites to reflect WG intake independently of adjusted confounders. These metabolites hence showed potential as biomarker candidates of WG intake, which, after repeated validation attempts, may aid in objective assessment of WG intake in future studies. Further investigations are warranted to assess the influence of individual factors, such as dietary patterns, lifestyle, and gut microbiota, on absorption, digestion, metabolism, and excretion of these biomarker candidates and their causal links with the potential benefits of WG on metabolic health. . We appreciate Biocenter Finland and Biocenter Kuopio for supporting our LC-MS laboratory facility. The funders had no contribution in study design, data collection, data analysis, preparation of the manuscript, or decision to publish.
Materials and/or code availability Data described in this manuscript will not be made available, because it contains sensitive personal data of the subjects, which cannot be completely anonymized. These data hence fall under General Data Protection Regulation (GDPR), which require restricted access only to authorized personnel with several protection measures. Interest in the access and use of the data is welcomed by submission of a written proposal to Jyrki Virtanen (jyrki.virtanen@ uef.fi). R packages notame for metabolomics data preprocessing (https:// github. com/ anton vsdata/ notame) and MUVR used for the Random Forest analysis (https:// gitlab. com/ CarlB runius/ MUVR) are available and freely accessible. Supplementary information is available online.

Conflict of interest All authors declared no conflict of interest.
Ethical approval The KIHD study protocol was approved by the Research Ethics Committee of the University of Kuopio (ethical approval number: #1983) and performed in compliance with the Declaration of Helsinki established in 1964 and its later amendments. Written informed consent was obtained from all participants before participation. Participants' personal information data were recoded and kept pseudonymized throughout the data handling procedure.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.