Introduction

The devastation caused by the COVID-19 pandemic is unprecedented in recent memory, with > 6.3 million deaths and 543 million cases worldwide in just over two years1. SARS-CoV-2 infection can cause a wide spectrum of illness, even in individuals who do not require acute hospital management; many individuals are asymptomatic (35.1% to 40.5% in meta-analyses2,3) while others report prolonged symptom duration (2.3% to 37.7%4). Colloquially known as “Long COVID”, the National Institute for Health and Care Excellence (NICE), defines two categories: Ongoing Symptomatic COVID (OSC28) for individuals with symptoms lasting 28–83 days (4–12 weeks), and Post-COVID syndrome (PCS84), for individuals with symptoms lasting over 84 days (12 weeks). They should display “signs and symptoms that have developed during or after an infection consistent with COVID-19 … and not explained by an alternative diagnosis”5,6.

The strongest predictors of PCS84 are age (with those aged 35–69 years having highest risk), female sex, and greater severity of acute infection7,8 Whilst vaccination against SARS-CoV-2 reduces the risk and duration of PCS849,10,11, prolonged post-infection symptomatology remains common. The United Kingdom Office for National Statistics report 2 million affected individuals in the United Kingdom by 01 May 2022, and 71% of individuals report that this affects normal daily activities12. The understanding of the pathophysiology and risk factors for these differing phenotypes-from asymptomatic infection to prolonged illness—is still evolving.

Early in the pandemic, it was noted that individuals with cardiovascular disease were at greater risk of severe illness13,14 and ‘Long COVID’15. Metabolomic profiles, particularly lipidomics, can identify risk of cardiovascular disease, with known associations of particular profiles with cardiovascular disease16 and Type 2 diabetes17,18. Such pre-pandemic lipidomics profiles have been associated with risk of hospitalisation for both COVID-19, and pneumonia caused by other pathogens, enabling the generation of an Infectious Diseases risk score (ID score) for hospitalisation due to COVID-1919. Disturbances to the same group of metabolites have also been observed in samples from hospitalised individuals when acutely unwell with COVID-1920,21. What is less clear is whether such metabolomic profiles are associated with disease duration, and/or Long-COVID, with the published studies focusing on specific metabolites in hospitalised individuals22.

However, most cases of COVID-19 are managed in the community rather than hospital. We aimed to assess whether metabolomic profiles differed in community-dwelling individuals with different symptom durations, comparing samples from asymptomatic individuals, to those with short duration, OSC28 and PCS84, approximately 6 months post-infection. This was further extended to include individuals with similar illnesses of prolonged duration, testing negative for SARS-COV-2 infection. We examined metabolites individually and assessed the previously published ID-score.

Metabolism has been related to the gut microbiome composition, with reports that the gut microbiome may separate individuals with PCS84 from healthy controls23,24. Therefore, we further explored whether stool microbiome composition, taken after acute illness and paired with metabolomics, was different between individuals with disease of different symptom duration, with and without previous confirmed SARS-CoV-2 infection. Finally, we tested whether there was any relationship between metabolomic profiles and gut metagenomic composition.

Results

Baseline characteristics of cohort

Of 15 564 individuals invited to the CSSB, 5694 (36.6%) consented and were enrolled. Of these, 4787/5694 individuals (84.1%) returned samples suitable for metabolomic analysis. Participant mean age was 52.5 years (SD 11.8), 78.7% were female, and 94.8% identified as White British, (Table 1 and Supplementary Table 2). The largest group was those with Acute COVID-19 illness (ACI) (n = 1147), lasting 7 days or less, followed by 652 with OSC28, and 307 Asymptomatic participants, our reference group. We included 287 who had a non-COVID-19 illness, of which 48/287 had symptoms over 84 days (Non-COVID-19 illness > 84 days—NC84). 161/287 were confirmed negative for SARS-CoV-2 infection by PCR testing at onset of illness, and all had a negative antibody test at enrolment (Table 1). Baseline characteristics of the groups were broadly similar to the population from which they were recruited and to each other, although the final asymptomatic group was slightly older than the average for the cohort (mean 58.1 years (SD 10.2) vs. 52.7 (SD 11.7) P < 0.001) (Table 1, Supplementary Table 2 + 3). All groups were predominantly female (75–84%, Table 1), with a median BMI of 25.2–26.2 kg/m2 (Tables 1, 2).

Table 1 Baseline characteristics of each illness category.
Table 2 Relative risk-ratios (RRR) for a 1 SD increased in ID Score, for each illness phenotype compared to asymptomatic COVID-19.

Metabolomic analysis

In total, 3718/4787 (77.7%) participants had adequate logging and metabolomic data, of whom 2561/3178 (80.6%) fell into the pre-specified phenotype groups (see Materials and Methods, Table 5).

Our primary analysis, using multinomial regression adjusted for age, sex, and BMI, compared asymptomatic participants to groups with longer symptom duration. We showed 90 of 249 (36%) metabolites differed in participants with OSC28 (28–83 days of COVID-19 symptoms) compared to Asymptomatic SARS-CoV-2-positive individuals (Fig. 1, Supplementary Fig. 1, Supplementary Table 4). 39 of these 90 (43%) also differed in participants with NC28 (Non-COVID-19 illness 28–83 days) compared to Asymptomatic SARS-CoV-2 infection, with the same direction of effect (Fig. 1, Supplementary Fig. 1, Supplementary Table 4).

Figure 1
figure 1

Relative risk ratio for each illness phenotype, per 1-SD increase in biomarker. Red indicates P ≤ 0.05 after FDR correction. Reference group (OR 1.0): Asymptomatic. ACI: Acute COVID-19 illness. OSC28: Ongoing symptomatic COVID-19 (28–83 days). PCS84: Post COVID-19 syndrome (≥ 84 days). NC28: Non-COVID-19 illness 28–83 days. NC84: Non-COVID-19 illness ≥ 84 days. Fatty Acids: Ratios& Proportions, and Absolute Values. DHA docosahexaenoic acid, SFA saturated fatty acids, MUFA monounsaturated fatty acids, PUFA polyunsaturated fatty acids. Cholesterol, LDL-C: Low density lipoprotein cholesterol, HDL-C: High density Lipoprotein cholesterol, Total-C: Total cholesterol, Total-Tg: Total Triglycerides, VLDL-C: Very low desnity lipoprotein cholesterol.

Amongst the subset of metabolites validated for clinical use (37 of 249 measurements)19, fatty acids differed in asymptomatic cases compared to both positive and negative symptomatic individuals. A higher ratio of polyunsaturated fatty acids (PUFA) compared to monounsaturated fatty acids (MUFA) was associated with a lower odds of prolonged COVID-19 (OR = 0.73, False Discovery Rate adjusted P-value (FDR-P) = 0.01 for OSC28 vs. asymptomatic), and non-COVID-19 illness (OR = 0.68, FDR-P = 0.01 for NC28 vs. asymptomatic). (Fig. 1 and Supplementary Table 5). We also noted an association of absolute MUFA levels with increased length of illness (OR = 1.28 [FDR-P = 0.04] for OSC28 vs. asymptomatic COVID-19). Similarly, in combination, raised triglycerides and VLDL lipids were associated with an increased risk of prolonged illness in both test-positive and test-negative individuals, as was the ratio of triglycerides to phosphoglycerides (Supplementary Table 4 + 5). In contrast, higher HDL lipoprotein levels and larger HDL particles were associated with Asymptomatic cases. Neither amino acids nor glycoprotein acetyls were associated with COVID symptom duration.

In both the clinically validated variables, and the entire metabolomics dataset, only 7/249 (2.8%) variables were significantly different in Long COVID (OSC28 or PCS84 combined) compared to Non-COVID illness (NC28 and NC84 combined) as the reference group (Supplementary Table 6). Of note, HDL-Cholesterol was also raised in Acute COVID-19 Illness (OR 1.24 [FDR P-value 0.005] for ACI vs Non-COVID illness) and further raised in Asymptomatic illness (OR 1.44 [FDR P-value 0.007] for Asymptomatic vs. Non-COVID illness).

The Infectious diseases score

The multi-biomarker infectious diseases score (ID Score) was calculated, whereby higher values were previously associated with hospitalization for COVID-1919. In our cohort, higher values of the ID Score were generally associated with longer duration of symptoms of all illnesses, but notably not Post-COVID Syndrome (Table 2).

Sensitivity analyses adjusting for additional variables

The direction of effect and significance of results were unchanged for most sensitivity analyses, including adjustment for baseline cardiovascular disease and diabetes; however, the effect of the ID Score was marginally stronger after adjusting for Healthy Plant-based diet index (OR = 1.61 with hPDI, OR = 1.53 without hPDI: for OSC28 compared to asymptomatic COVID-19) (Supplementary Fig. 2).

Microbiome demographics

A subset (n = 301) of the metabolomic cohort had corresponding microbiome data (Table 3). The median time between symptom onset and microbiome assessment was 223 (IQR 50 days), with the minimum time between symptom onset and microbiome assessment of 33 days (implications considered further in Discussion).

Table 3 Baseline Characteristics for subset used for microbiome analysis, per COVID group.

Quality control of gut microbiome sequence data for these 301 individuals revealed good breadth of coverage of MetaPhlAn markers. In addition, depth of coverage of MetaPhlAn markers was high for most abundant species (~ 3X), with large areas of < 0.5X coverage (Supplementary Fig. 3).

For this subset analysis, individuals were grouped into categories as follows: (1) Asymptomatic: n = 35; (2) ACI: n = 109; (3a) OSC28 n = 52; (3b) Post COVID-19 syndrome (≥ 84 days): n = 20; (4) Negative symptomatic (≥ 28 days, including ≥ 84 days): n = 85 .

Alpha- diversity analysis

Microbial richness did not differ between groups (Wilcoxon signed-rank test P-value > 0.25 for all comparisons, Supplementary Fig. 4). There were no differences in Alpha-diversity (Simpson or Shannon) between groups, whether they were test positive or negative, symptomatic or asymptomatic, or with long or short symptom duration (Supplementary Fig. 4).

Beta-diversity analysis

Similarly, beta-diversity analysis showed no large-scale shifts in microbial profiles between groups (Supplementary Fig. 4).Ongoing symptomatic COVID-19 and Post COVID-19 syndrome groups were amalgamated and beta-diversity analysis repeated; there remained no significant difference (data not shown).

Microbial differential abundance analysis

A Generalised Linear Model controlled for confounding factors, including age, sex, and BMI identified three species with differences between groups—specifically, Firmicutes bacterium CAG 94, Ruminococcus callidus and Streptococcus vestibularis. Firmicutes bacterium CAG 94 differed between Asymptomatic SARS-CoV-2 infection (ref) and Acute COVID-19 (FDR P-value = 0.03), OSC28 (FDR P-value = 0.01) and PCS84 (FDR P-value = 0.04). Ruminococcus callidus, also a Firmicute, differed between Asymptomatic and Non-COVID-19 illness (≥ 28 days) (FDR P-value = 0.01) (Table 4). Streptococcus vestibularis differed only when comparing OSC28 and non-COVID-19 participants (FDR P-value 0.073).

Table 4 Potential biomarkers most divergent between COVID groups.

Correlation between metabolomic and microbiome analysis

Correlations between the metabolites and microbial taxa were assessed using Spearman’s rank coefficient. We found no evidence that microbiome taxa were associated with differences in those metabolites associated with symptom duration in this dataset (Fig. 2). Negatively correlated metabolites and microbial species clustered in the top right-hand corner, with mild positively correlated associations throughout the remainder of the heatmap. A distinctive pattern was unable to be elucidated. The primary species driving the negative correlation were Alistipes finegoldii, an anaerobic, mesophilic, rod-shaped bacterium and Bacteroides cellulosilyticus, a cellulolytic bacterium.

Figure 2
figure 2

Spearman correlation of microbiome profiles and metabolomic profiles. Single microbial taxa correlated with clinically validated metabolomic data using Spearman’s rank sum non-parametric test. FDR P-values are displayed *P < 0.01. Rows and columns are hierarchically clustered (Euclidean distance). The R package ‘corrplot_0.90' (https://github.com/taiyun/corrplot) was used to compute the variance and the covariance or correlation. The packages ‘pheatmap_1.0.12’ (https://cran.r-project.org/web/packages/pheatmap/index.html) and the ‘cor.mtest' function of ‘corrplot_0.90’ were used to visualise the heatmap and calculate associated P values.

Discussion

Summary of results + results in context

We observed a metabolic profile, particularly in lipid components, that differentiated individuals with longer symptom duration compared with individuals with asymptomatic infection. This profile was evident for individuals with long symptom duration regardless of SARS-CoV-2 test status, compared with asymptomatic infected individuals.

The specific differences identified in association with long-duration illness were small in magnitude and related to atherogenic dyslipidaemia (Fig. 3); in particular, blood fatty acid concentrations, including higher absolute and relative concentrations of MUFA, lower relative concentrations of PUFA, and a lower PUFA/MUFA ratios. In humans, circulating PUFA is derived from dietary sources, and blood levels correlate both to dietary intake and to levels in adipose tissue stores. MUFA, on the other hand, is synthesised in significant quantities in vivo and circulating concentrations (but not dietary intakes) are associated with increased risk of cardiometabolic disease25. Elevated serum MUFA and low serum PUFA have been identified in many studies as associated with ill health, including cardiovascular risk26, metabolic syndrome27, and mortality from infections28. Although our study assayed blood levels up to 9 months after initial COVID illness, these measures, particularly MUFA, are relatively stable over time19,29 and may reflect long-term blood concentrations. Moreover, our results concord with previous work in the UK Biobank, using the same platform, which demonstrated a similar direction of association with the same metabolites with increased risk of acute severe COVID-19 and with pneumonia, using blood samples collected many years beforehand19. Recent studies have reminded clinicians of the increased risk of vascular diagnoses after both COVID-19 and similar respiratory infections30,31,32 and while the risk reduces dramatically after the acute period, there is excess risk which remains many months afterwards. It is possible that our findings may reflect un-detected prior cardiovascular risk in symptomatic COVID-19 cases or vascular changes as a consequence of disease.

Figure 3
figure 3

Atherogenic-dyslipidaemic biomarkers. Relative risk ratio for each illness phenotype, per 1-SD increase in biomarker. Adjusted for age, sex and body mass index. 95% Confidence intervals displayed with P-values adjusted using Benajmini-Hochberg False Discovery Rate correction. Red indicates FDR corrected P-value ≤ 0.05. ACI: Acute COVID-19 illness, OSC28: Ongoing symptomatic COVID-19 (28–83 days), PCS84: Post COVID-19 syndrome (≥ 84 days), NC28: Non-COVID-19 illness 28–83 days, NC84: Non-COVID-19 illness ≥ 84 days, ApoB_by_ApoA1: Ratio of apolipoprotein B to apolipoprotein A1, HDL_C: High density lipoprotein cholesterol, HDL_L: Total Lipids in high density lipoprotein, LDL_TG: Triglycerides in low density Lipoprotein, L_HDL_P: Concentration of large high density lipoprotein particles, MUFA: Monounsaturated Fatty Acids, PUFA_by_MUFA: Ratio of polyunsaturated fatty acids to monounsaturated fatty acids, Remnant_C: Remnant cholesterol (non-HDL, non-LDL -cholesterol), S_HDL_TG: Cholesterol in small HDL, VLDL_C: Very low density lipoprotein cholesterol, VLDL_L: Total lipids in VLDL, VLDL_TG: Triglycerides in VLDL, VLDL_size: Average diameter for VLDL particles.

Higher levels of VLDL-particles/lipids and TG-enriched lipoproteins were associated with longer illness duration, although again effects were relatively small in absolute terms. Both components have also been associated with ill-health in other (pre-pandemic) studies—in particular, cardiovascular disease, diabetes, renal disease, obesity and depression33,34, although studies did not always adjust for diet35. In our study, adjusting for dietary intakes or co-morbidities did not change the findings. We also demonstrate similar findings to previous work associating peripheral vascular disease with metabolic profiles representing atherogenic dyslipidaemia (Fig. 3)16,36.

We did not find metabolomic profiles specific to Long COVID comparing individuals with long-duration illness who were positive vs. negative for SARS CoV-2 infection. This would suggest that atherogenic-dyslipidaemic metabolic profiles are associated with long symptom duration, regardless of the cause of illness. This is supported by a growing body of literature identifying “conventional” risk factors such as high BMI, diabetes and cardiovascular disease, as risk factors for long COVID, as well as age7,37 all of which are associated with adverse metabolic changes26,34,38. There was no meaningful change in our results with inclusion of self-reported cardiovascular disease and diabetes as co-variates (Supplementary Fig. 2).

There were some unexpected negative findings. Others have associated GlycA with increased mortality39, and it is often increased in other conditions such as diabetes34. It was the biomarker showing the greatest association with hospitalisation for COVID-19 using pre-pandemic UK Biobank samples19. However, we found no association of GlycA with illness duration in our community-based sample. GlycA is considered a marker of inflammation, and our participants were sampled many months after the onset of symptoms and SARS-CoV-2, and were no longer reporting symptoms. It is possible that prolonged symptoms after COVID-19 may not always be related to ongoing inflammation, but previous damage that has not been repaired. There was also no association between any of the amino acid metabolites and length of illness (Supplementary Fig. 1), although some amino acids have previously been associated with hospitalisation for COVID-1919 and were included in the ID Score.

Looking at the ID score, previously associated with severity of acute COVID-1919, we also demonstrated an increased score was associated with increased length of illness. This was replicated for both COVID-19 and those reporting ongoing symptoms without COVID-19. This suggests that there are shared associations which link severe acute illness and prolonged illness. This tallies with our previous observation that participants with a high acute symptom burden early in their illness were at greater risk of Long COVID7. Others have shown perturbations in lipid profiles are also associated with severity of disease, and correlate with changes in immune cell populations and function. In particular, depletion of lipids is often associated with moderate and severe acute disease, compared to mild disease, and associates with shifts in immune cell profiles, including the balance of CD8 and effector T-Cells, and monocyte subsets40. Particular disturbances in natural killer cell (NK-Cell) function and phenotype have also been noted to correlate with severity of acute illness. These further correspond to changes in TNF and INF-α signatures, and FcRγ expression41,42. There are suggestions this may relate to different metabolic profiles of immune cells, metabolic reprogramming, and hepatic injury43,44. Further work would be needed to see if similar correlations with immune function hold true for illness duration, both during the initial phase of the illness, as well as during ongoing, prolonged illness, compared to those who have recovered.

Although 97 metabolites and the ID score were associated with Ongoing Symptomatic COVID-19 illness, only 16 of these were still associated with Post-COVID-19 syndrome. The PCS84 group is much smaller, and may be heterogeneous. Further work is needed to see whether certain subgroups within PCS84 display similar associations to OSC28, whilst other subgroups do not. However, it is of note that those most at risk for PCS84, women in mid-life4,45, are less at risk for some conditions associated with metabolic derangement, such as cardiovascular disease46,47. Our results might suggest that truly persistent symptoms in PCS84 may represent a different type of disease to OSC28, NC28 and NC84, with different risk factors, pathophysiology, and may therefore need different interventions to prevent or treat.

There was little change in the effect of the ID score on length of illness in our sensitivity analyses including pre-morbid cardiovascular disease and diabetes, frailty, lifestyle, IMD, the use of any supplements and the Diet Quality Score, suggesting that our metabolite associations with length of illness are independent of their associations with these variables. Models controlling for whether individuals took omega-3 supplements showed a diminished relationship between ID score and length of illness in non-COVID disease but were unchanged in COVID-19 illness. This finding which might suggest that omega-3 supplementation is a marker explaining this relationship in non-COVID illness only. The addition of the healthy plant-based diet index to the model increased the effect size of the ID score for each length of illness. Further work on hPDI and other correlated variables, such as socio-economic status, might help explore this finding.

We found no large-scale shift in gut microbiome, assessed up to 9 months post symptom onset, in relation to illness duration after SARS CoV-2 infection, and notably, no difference between COVID-19 positive and negative individuals with long-duration symptoms. Analysis of both taxonomic and functional microbial profiles showed increases in relative abundance of some Firmicutes in asymptomatic individuals, for example Firmicutes bacterium CAG:94, an uncharacterized taxon requiring further study to determine its functions. This finding should be interpreted with caution, and we did not find evidence that it related to metabolite alterations. Here, timing of our sampling is highly relevant to the interpretation of our results. Alterations in an individual’s microbiome may occur in the context of acute disease48, whether from infection, dietary changes, medication, usage, and/or immune function; however, an individual’s microbiome regresses over time49 to reflect a stable baseline microbiome50,51. Thus, faecal microbiome samples collected 9 months after the start of illness may represent baseline individual microbiome, which, in our study, did not relate to symptom duration. In any case, our findings would suggest no long-term alterations in microbial profile in individuals who experienced ‘long COVID’. Our results contrast to one small previous study of 106 hospitalised individuals, of whom 76% had ongoing symptoms 6 months after acute SARS-CoV-2 infection. In this study, altered gut microbiome composition was reported in individuals with persistent symptoms in patients with COVID-19 compared to healthy historical pre-pandemic controls52. However, cases were hospitalised for an average of 17 days, received amoxicillin-clavulanic acid among other interventions (including Ribavirin, Interferon and Remdesivir), and changes seen could have been a consequence these treatments, illness severity and/or altered diet in these individuals. Two other studies noted changes in immune-modulating commensal bacteria53,54 potentially specific to COVID-19, but again in hospitalised cohorts, where additional treatments, and illness severity, differ from our community-dwelling participants.

Strengths + Limitations

Our study benefits from large size, with a long duration of prospective symptom reporting, and availability of both metabolite and microbiome data on the same community-based participants. These platforms are well validated, including for clinical use of the metabolomics data. Our participants also reflect the spectrum of COVID cases in the community. As participants were recruited prior to vaccination in the wild-type era, this reduces complexity by variance attributable to virus variant, and type and timing of vaccination55. Infection status was reconfirmed at enrolment, ensuring accuracy of classification by gold standard methodology. During this period, there were also relatively few acute COVID-19 specific treatments available, and none routinely used in UK community-managed individuals allowing our study to reflect the natural history of COVID-19.

We have also been able to conduct sensitivity analyses including variables such as frailty, lifestyle, deprivation, and diet, often not extensively accounted for in metabolomics studies, in addition to more traditional medical co-morbidities. We also have analysed participants who report ongoing symptoms not attributable to SARS-CoV-2, and therefore able to test the specificity of our findings.

Limitations include the cross-sectional nature of the metabolites and microbiome assayed up to 9 months after illness onset. Time of sampling, for both the metabolomics and microbiome analysis is both a strength and a limitation. Post convalescence sampling means that acute perturbations are likely to have resolved, but without pre-pandemic sampling we cannot assess whether changes found were consequential or pre-existing. Longitudinal data from cohorts sampled before and after pandemic are needed to address this issue. We have also considered direct validation of our findings in another cohort, but due to the sui generis method of classification using the ZCS data, other cohorts’ participants would lack the same data to be classified into directly comparable groups.

Conclusion

Metabolic profiles of community cases with asymptomatic COVID-19 were notably different to those with longer illnesses, displaying an atherogenic lipoprotein phenotype, and this difference was apparent regardless of whether the illness was due to COVID-19 or another acute phenomenon. Those with COVID-19 symptoms for ≥ 28 days could not be clearly distinguished from those with non-COVID-19 illnesses of prolonged duration. A biomarker score previously predictive of severe COVID-19 was overall predictive of prolonged illness, although not all individual components were. In contrast, gut microbiome diversity did not differ by length of illness, suggesting no significant gut microbiome dysbiosis post COVID-19 infection.

Further research with longitudinal sampling pre- and post-illness is warranted, to determine if the observed metabolomic associations with longer illness are pre-existing risk factors, or consequential.

Materials and methods

Cohort description

Study participants were volunteers from the COVID Symptom Study Biobank (CSSB, approved by Yorkshire & Humber NHS Research Ethics Committee Ref: 20/YH/0298). Individuals were recruited to the CSSB via the ZOE COVID Study (ZCS)56 using a smartphone-based app developed by Zoe Ltd, King’s College London, the Massachusetts General Hospital, Lund University, and Uppsala University, launched in the UK on 24 March 2020 (approved by the Kings College London Ethics Committee LRS-19/20–18,210). Via the app, participants self-report demographic information, symptoms potentially indicative of COVID-19 disease (both closed/polar questions, and free text), any SARS-CoV-2 testing, and SARS-CoV-2 and influenza vaccinations. Participants can be invited via email to participate in other studies, according to eligibility.

In October 2020, prior to UK vaccination roll-out, 15,564 adult participants from the ZCS were invited to join the CSSB. Invited participants had: (a) a self-reported SARS-CoV-2 test result: A swab test (in this time period RT-PCR) at the start of illness, or a subsequent antibody test, whether positive or negative; and (b) logged at least once every 14 days since start of illness, or since the start of logging if asymptomatic.

Initially, individuals were recruited in four groups based on understanding of Long COVID at that time, and prior to definitions being published: (1) Asymptomatic, with confirmed infection; (2) Short illness (≤ 14 days) after confirmed infection; (3) Long illness (≥ 28 days)5 after confirmed infection; and (4) Long symptom duration (≥ 28 days) but with a negative test for SARS-CoV-2 infection. Invited individuals were matched across these four groups by Euclidian distance for age, sex and BMI9. Participants were invited by email, and consented separately into the CSSB. Participants were sent home sampling kits in November 2020 via post, and returned capillary blood samples for metabolomic analysis. This also enabled antibody testing using an ELISA method57 to confirm prior infection status of all participants, the current standard for retrospective ascertainment58. A subgroup also consented to send in stool samples for analysis of their gut microbiome.

Study participants were subsequently aligned using symptoms ascertained up to sample collection date, and the permissible gap in logging was further tightened to 7 days to increase accuracy of classification. Long-COVID groups were reshaped to match definitions published in November 20205 (see Table 5) corresponding to Ongoing Symptomatic COVID-19 (28–83 days, OSC28) and Post-COVID-19 Syndrome (> 84 days, PCS84). The same duration parameters were applied to those reporting symptoms with the same timeframe parameters around a negative test for SARS-CoV-2, who were presumed to have a non-COVID-19 illness. This yielded six groups for comparison—four SARS-CoV-2 positive groups: Asymptomatic, Acute COVID-19 (≤ 7 days), OSC28, PCS84; and two SARS-CoV-2 negative groups: Non-COVID-19 illness 28–83 days (NC28), Non-COVID-19 illness ≥ 84 days (NC84).

Table 5 Table describing definitions used to group individuals based on their swab (RT-PCR) or antibody test status and duration of symptoms, as logged in ZOE COVID Symptom Study app.

To check that groupings assigned by the recruitment algorithm were clinically accurate, symptom logging maps were scrutinised in a subsample (n = 115) by two researchers (MFÖ, CJS), independently and blind to algorithmic phenotype classification, before analysis. Final categories are detailed in Table 5.

Due to changes in logging stringency criteria, some participants also fell into additional, shorter categories of illness duration, detailed in Supplementary Table 1. These additional phenotypes have been reported in supplementary data tables, but not included in primary analysis as they were not recruited for this purpose, and their classification is less certain.

Metabolomics

Capillary blood samples were obtained between November 2020 and January 2021, when participants had recovered. Samples were returned in plasma collection tubes with initial processing of 20µL used for serology with the remainder frozen. Samples were processed in March/April 2021 by Nightingale Health Oyj (Helsinki, Finland) using high-throughput nuclear magnetic resonance metabolomics, measuring 249 metabolites including lipids, lipoprotein subclasses with lipid concentrations within fourteen subclasses, lipoprotein size, fatty acid composition, and various low-molecular weight metabolites including amino acids, ketone bodies and glycolysis metabolites33. Of these, 37 are certified for clinical diagnostic use and formed the focus of our analysis (referred to herein as “Clinically Validated”)59. Quality control was performed and reported by Nightingale Health. Due to postal transit time, glucose, lactate, and pyruvate could not be assessed and have been excluded from analyses, and creatinine was unavailable. There were no concerns raised with other biomarkers. Metabolites measured using this panel have been associated with the risk of hospitalisation for COVID-19 in the UK Biobank previously19, including 25 of the clinically validated biomarkers used in an Infectious Diseases risk prediction score (ID Score) derived using Lasso regression19.

Gut microbiome

Sample collection and faecal sample processing

Two faecal samples per individual were collected and returned by post: faecal material from both collection tubes were homogenised in a Stomacher® bag, aliquoted out and stored at -80 degrees Celsius. The first 301 samples that would maintain a balance for BMI, age, and sex, were selected to undertake a pilot investigation of microbiome differences.

DNA extraction and sequencing

Genomic DNA (gDNA) was isolated from 1 g faecal sample, using a modified protocol of the MagMax Core Nucleic Acid Purification Kit and MagMax Core Mechanical Lysis Module60. Libraries were prepared using the Illumina DNA Prep (Illumina Inc., San Diego, CA, USA) following the manufacturer’s protocol. Libraries were sequenced (2 × 150 bp reads) using the S4 flow cell on the Illumina NovaSeq 6000 system.

Metagenome quality control and pre-processing

All metagenomes were quality controlled using the pre-processing pipeline (available at https://github.com/SegataLab/preprocessing). Briefly, pre-processing consisted of three main steps: (i) read-level quality control, (ii) removal of host sequence contaminants, and (iii) splitting and sorting of cleaned reads. Read-level quality control removes low-quality reads (quality score < Q20), fragmented short reads (< 75 bp), and reads with ambiguous nucleotides (> 2 Ns), using trim-galore (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). Host sequences contaminant DNA were identified using Bowtie 261 with the “–sensitive-local” parameter to remove both the phiX 174 Illumina spike-in and human-associated reads. Splitting and sorting allowed for creation of standard forward, reverse, and unpaired reads output files for each metagenome.

Taxonomic and functional profiling

The metagenomic analysis was performed using the bioBakery 3 suite of tools62. Taxonomic profiling and estimation of species’ relative abundances were performed with MetaPhlAn 3 (v. 3.0.7 with “–stat_q 0.1” parameter)62,63. MetaPhlAn 3 taxonomic profiles were used to compute three alpha diversity measures: (i) the number of species with positive relative abundance in the microbiome (‘Richness’), (ii) the Shannon diversity index, independent of richness, which measures how evenly microbes are distributed64, and (iii) the Simpson diversity index, which accounts for the proportion of species in a sample65. Similarly, species-level relative abundances were used to estimate microbiome dissimilarity between participants (beta diversity) using the Bray–Curtis dissimilarity metric, which accounts for the shared fraction of the microbiome between two individuals and their relative abundance values66. Functional potential profiling of metagenomes was performed with HUMAnN 3 (v. 3.0.0.alpha.3 and UniRef database release 2019_01)62,67 that produced pathway profiles and gene family abundances. We assessed beta diversity by computing a Principal Coordinates Analysis (PCoA)/Metric Multidimensional Scaling (MDS) based on the pairwise Bray–Curtis dissimilarity metric.

Additional covariates

BMI was derived from self-reported weight and height. Other self-reported information (obtained from ZCS app-reported data) included smoking; and co-morbid illness (‘heart disease’, ‘diabetes’ (and type of diabetes), ‘lung disease’ (including asthma), hay fever, eczema, ‘kidney disease’ and current cancer (type, and cancer treatment). Address data was linked to the UK Index of Multiple Deprivation (IMD), with the IMD rank decile used as a categorical variable measuring local area deprivation68,69,70,71. Frailty was assessed using the Prisma-7 scale, with a score > 2 indicating frailty72.

A subset of participants had participated in a dietary assessment during the COVID-19 pandemic, also recruited through the ZCS (published previously73,74). This included detailed information on vitamin supplementation (including omega-3 oils), physical activity, alcohol consumption, dietary habits and a food frequency questionnaire. These data were used to derive a diet quality score73, and a plant-based diet index73, analysed as continuous variables. Both have previously been associated with cardiovascular disease75, Type 2 Diabetes76, a lower risk of COVID-19 illness, and a lower risk of hospitalisation for COVID-19 during the early waves of the pandemic73.

Statistical analysis

The statistical analyses were performed using R software (v. 4.0.5) and Stata (v.17, StataCorp). Baseline characteristics were described by frequency and percentages. Descriptive data on those invited and those enrolled, are presented in Supplementary Table 2 + 3. Metabolites were all log-transformed and standardised (mean 0, standard deviation 1) as per protocol19. To account for 0 values, prior to log transformation, a pseudo-count of 1 was added to all values.

Initial analysis examined association between duration of illness and each metabolite individually, using multinomial logistic regression (all adjusted for age, sex, and BMI). The Asymptomatic group was used as the reference category of the outcome variable. We also performed a secondary analysis, using the non-COVID-19 participants as a reference category. The primary analysis was then extended, assessing association between length of illness and ID score, with asymptomatic as the reference category. The Benjamini–Hochberg False discovery rate method was used to correct P-values for multiple testing77.

To assess potential confounders, we performed eight sensitivity analyses additionally adjusting for: (1) self-reported co-morbidities (cardiovascular disease and diabetes), (2) Frailty, (3) IMD rank decile, (4) lifestyle variables (smoking status, frequency of alcohol consumption, frequency of physical activity), (5) self-reported use of any health supplement, (6) self-reported use of Omega-3 containing supplements, (7) Diet quality score, and (8) Healthy plant-based diet index (hPDI).

For microbiome analyses, differences between the alpha diversity distributions of the groups were assessed using the Wilcoxon rank-sum test within the ‘RClimMAWGEN’ package (P-value ≤ 0.05 considered significant). With a sample size of 300 individuals, we have 79% power at 0.05 significance level, assuming a low effect size of 0.20. PERMANOVA from the ‘adonis2’ function of the ‘vegan’ package, was used to test for differences between groups based on the beta diversity computed from the PCoA/MDS of the pairwise Bray–Curtis dissimilarities. For the microbial differential abundance analysis, we built a generalized linear model, controlling for confounding factors, including age, sex,and BMI. Only species with minimum 20% prevalence were used in this statistical analysis78. P-values were corrected using the Benjamini–Hochberg method77.

Spearman correlation analyses were conducted to associate microbiome profiles of 301 individuals with their metabolome profiles, adjusting for confounding factors (age, sex, BMI). Correlation analyses were conducted using R version 3.6.0. The package ‘corrplot_0.90' was used to compute the variance and the covariance or correlation, ‘pheatmap_1.0.12’ and ‘cor.mtest' were used to visualise the heatmap, calculate associated P values. Hierarchical clustering of both top 39 most abundant microbial species and 39 metabolic profiles was conducted using hclust, implementing the Ward.D2 agglomeration method. The package ‘p.adjust’ was used to perform Benjamini–Hochberg multiple testing correction77.

Ethical approval

The ZOE COVID Study by the King's College London Ethics Committee (Ref: LRS-19/20-18,210) and licensed under the Human Tissue Authority (reference 12,522). All ZCS participants provided informed consent for use of their data for COVID-19 research. The COVID Symptom Study Biobank and related studies, including this study, were approved by the Yorkshire & Humber NHS Research Ethics Committee (Ref: 20/YH/0298). CSSB participants were invited to join from the ZCS user base and provided informed consent to participate in the additional questionnaire and sample collection studies, and for linkage to app-collected data. All research and sample processing has been carried out in line with relevant guidelines including the Declaration of Helsinki.