Metabolic characterisation of disturbances in the APOC3/triglyceride-rich lipoprotein pathway through sample-based recall by genotype

Introduction High plasma triacylglyceride levels are known to be associated with increased risk of atherosclerotic cardiovascular disease. Apolipoprotein C-III (apoC-III) is a key regulator of plasma triacylglyceride levels and is associated with hypertriglyceridemia via a number of pathways. There is consistent evidence for an association of cardiovascular events with blood apoC-III level, with support from human genetic studies of APOC3 variants. As such, apoC-III has been recognised as a potential therapeutic target for patients with severe hypertriglyceridaemia with one of the most promising apoC-III-targeting drugs, volanesorsen, having recently progressed through Phase III trials. Objectives To exploit a rare loss of function variant in APOC3 (rs138326449) to characterise the potential long-term treatment effects of apoC-III targeting interventions on the metabolome. Methods In a recall-by-genotype study, 115 plasma samples were analysed by UHPLC-MS to acquire non-targeted metabolomics data. The study included samples from 57 adolescents and 33 adults. Overall, 12 985 metabolic features were tested for an association with APOC3 genotype. Results 144 uniquely annotated metabolites were found to be associated with rs138326449(APOC3). The highest proportion of associated metabolites belonged to the acyl-acyl glycerophospholipid and triacylglyceride metabolite classes. In addition to the anticipated (on-target) reduction of metabolites in the triacylglyceride and related classes, carriers of the rare variant exhibited previously unreported increases in levels of a number of metabolites from the acyl-alkyl glycerophospholipid and ceramide classes. Conclusion Overall, our results suggest that therapies targeting apoC-III may potentially achieve a broad shift in lipid profile that favours better metabolic health. Electronic supplementary material The online version of this article (10.1007/s11306-020-01689-9) contains supplementary material, which is available to authorized users.

set as follows: Sheath gas = 50 arbitrary units, Aux gas = 13 arbitrary units, Sweep gas = 3 arbitrary units, Spray Voltage = 3.5kV, Capillary temp. = 263 °C, Aux gas heater temp. = 425 °C. Data dependent MS2 in 'Discovery mode' was used for the MS/MS spectra acquisition using following settings: resolution = 17,500 (FWHM at m/z 200); Isolation width = 3.0 m/z; stepped normalised collision energies (stepped NCE) = 20, 50, 80%. Spectra were acquired in three different mass ranges: 200 -400 m/z; 400 -700 m/z; 700 -1500 m/z. A Thermo ExactiveTune 2.8 SP1 build 2806 was used as an instrument control software in both cases and data were acquired in profile mode. Quality control (QC) samples were analysed as the first ten injections and then every seventh injection with two QC samples at the end of the analytical batch. Two blank samples were analysed, the first as the 6th injection and then at the end of each batch.

Raw data processing
Raw data acquired in each analytical batch were converted from the instrument-specific format to the mzML file format applying the open access ProteoWizard software (Chambers et al. 2012). Deconvolution was performed with XCMS software according to the following settings of min. peak width (4 for HILIC and 6 for LIPIDS); max. peak width (30); ppm (12 for HILIC and 14 for LIPIDS); mzdiff (0.001); bw (0.25); mzwid (0.01) (Smith et al. 2006). A data matrix of metabolite features (m/z-retention time pairs) vs. samples was constructed with peak areas provided where the metabolite feature was detected for each sample. Data for the first 8 QC samples were removed from the dataset. Putative annotation of metabolites or metabolite groups was performed by applying the PUTMEDID-LCMS workflows operating in the Taverna workflow environment (Brown et al. 2011). We applied 5 ppm mass error and a retention time range of 2 s in feature grouping and molecular formula and metabolite matching.
Because different metabolites can be detected with the same accurate m/z (for example, isomers with the same molecular formula), multiple annotations could be observed for a single detected metabolite feature. Also, a single metabolite could be detected as multiple molecules, particularly as a different type of ion (e.g., protonated and sodiated ions). Throughout this article, the term "metabolite" refers to either single metabolites or groups of molecules with the same retention time and the same accurate m/z. All molecules were annotated according to guidelines for reporting of chemical analysis results, specifically to Metabolomics Standards Initiative level 2 (Sumner et al. 2007).

Assessment of data quality
The data for pooled QC samples were applied to perform QC filtering. Data matrices were corrected for run-order drift in intensity using the Quality Control-Robust Spline Correction (QC-RSC) algorithm (Kirwan et al. 2013) (https://github.com/computationalmetabolomics/sbcms). For each metabolite feature detected QC samples 1-8 were removed and the relative standard deviation (RSD) and percentage detection rate were calculated. Metabolite features with a relative standard deviation (RSD) > 30% and a percentage detection rate < 70% were removed from the dataset. Details of the numbers of features before ("Raw data") and after this process ("Post-QC data") can be found in Table S2.

Post-raw data processing workflow
A series of procedures were applied to the data derived from the processing steps described above in preparation for the statistical analysis of the data, as described in Fig. 1. All procedures were carried out in R v3.5 or later (R Core Team 2019). Each of the four datasets (HILIC-POS, HILIC-NEG, LIPIDS-POS and LIPIDS-NEG) was processed separately as follows. To increase the quality of the data, two filters were applied to the study samples (i.e. not including QC samples) and one to the features. Firstly, samples were excluded from all subsequent analyses if they had more than 50% missing data across the set of features. This resulted in one sample exclusion from the HILIC-POS dataset. Secondly, total peak area was calculated for each sample and samples whose calculated peak area fell outside three standard deviations from the mean were excluded. This resulted in two sample exclusions from the HILIC-NEG dataset, three from the HILIC-POS dataset, one from the LIPIDS-NEG dataset and three from the LIPIDS-POS dataset. After sample exclusion, within-class (carrier vs. control) feature missingness was calculated. Features were excluded from all subsequent analyses if they had >30% missing data in both classes. Details of the numbers of features and after this process ("Post-filtering data") can be found in Table S2.
As is typical in metabolite data of this nature, levels of feature missingness were relatively high (up to 52.6% missing) with the majority of missing data points likely due to very low levels of the metabolite, i.e. below the level of detection. Therefore, an iterative imputation method based on a random forest was used to impute missing values in the dataset. The R package missForest (Stekhoven and Buhlmann 2012) was used with its default settings. Both the unimputed and the imputed versions of the data were taken forward into the analysis with results from the former representing the primary analysis and those from the latter a sensitivity analysis included to assess the likely impact of missing data on our primary analysis.
In preparing the data for parametric analyses, the next step was normalisation. Data were normalised across samples using probabilistic quotient normalisation (Frank Dieterle et al. 2006) (PQN) implemented using the normalisation function in the KODAMA R package (Cacciatore et al. 2016;Eisen et al. 1998;Frank Dieterle et al. 2006). The median value across all available QC samples in the dataset was used as the reference sample for PQN. Following normalisation, the QC samples were removed from the dataset. Exploration of the data revealed the distributions of 53-56% (range across the four datasets) of features to be nonnormal based on a Shapiro-Wilk test of normality (w>0.95) (J. P. Royston 1982aRoyston , 1982bP. Royston 1995). Therefore, a rank-transformation to normality (Aulchenko et al. 2007) was applied across all features in preparation for statistical analysis.
There is evidence that levels of at least some metabolites can be influenced by fasting time (Sedlmeier et al. 2018) and for this reason, we conducted our primary analysis only on fasted samples. However, this resulted in the exclusion of nine samples from the analysis, all from rare variant carriers. Therefore, we conducted a sensitivity analysis in which the primary analysis of the combined dataset was rerun as described above but with no sample exclusions being made based on the fasting/non-fasting status of the samples (n=51 carriers/64 noncarriers).
The imputed version of the data was used to assess the technical variability (measured by the replicate analysis of a pooled QC sample) and biological variability as part of the quality control process. Data were normalised and rank-transformed as described above but with QC samples retained. A principal components analysis (PCA) was then performed using the

Models fitted
In the young participants only and mothers only analyses we used either a simple linear multivariate model or a mixed linear multivariate regression model fit by maximum likelihood to test the association between each feature (the dependent variable) and rs138326449(APOC3). Only samples with non-missing data for the feature being tested were included in the analysis of that same feature (hence the actual sample size varies by feature).
If, after excluding samples with missing data, less than five repeated measures remained in the dataset, a simple linear model was used (fitted using the R function lm()), otherwise a mixed linear regression model was fitted with individual included as a random effect in order to take account of repeated measures (lmer()). In order to combine all data (young participants and mothers) into a single model, we extended the mixed linear regression model to incorporate the effects of pedigree using the R package 'pedigreemm' (Vazquez et al. 2010). In this way we were able to include both young participants' and mothers' data in the same model whilst accounting for mother-child relationships and repeated measures within the sample set. All models were fitted with age and sex (except in mothers only analyses) fitted as fixed effects.
The association between each feature and rs138326449(APOC3) was tested by performing a likelihood ratio test of nested models, comparing models with and without class status (carrier vs. control) fitted as a fixed effect. Resulting p-values were then adjusted to take account of multiple testing using the false discovery rate-based method of Benjamini and Hochberg (BH) (Benjamini and Hochberg 1995).

Feature identification
PUTMEDID_LCMS (Brown et al. 2011) was applied to annotate metabolites using accurate mass-to-charge (m/z) data using KEGG, the Human Metabolome Database and LipidMaps as reference databases containing all known metabolites. LipidSearch (Peake et al. 2015) was applied to annotate metabolites using accurate m/z MS/MS data with matching of experimental MS/MS data to theoretical MS/MS data. To obtain greater confidence in the metabolite annotation, experimental MS/MS fragmentation data was also compared to the reference mass spectral libraries mzCloud (www.mzcloud.org) and the in-silico mass spectral library using the LipidSearch software (Peake et al. 2015). The expected retention time ranges, derived from authentic chemical standards, were applied to remove false annotations related to lipid metabolites.

Data visualisation
Venn diagrams were generated using the R package VennDiagram (Chen and Boutros 2011) to show the overlap between associated features identified firstly, in the three primary analyses and, secondly, in the primary analysis compared to the sensitivity analyses. Heatmaps were generated using the R package heatmap3 (Zhao et al. 2014). Input data for the heatmaps was the rank-transformed data for the subset of associated and identified features, residualised on age and sex. Dendrograms were constructed in R v3.5 or later (R Core Team 2019) ('stats' package) based on (Euclidean) distance matrices and subsequent clustering based on Ward's method (Ward 1963), the Ward2 algorithm (Murtagh and Legendre 2014). All other plots were created in ggplot2 (Wickham 2016).

ALSPAC Cohort description Description of study numbers
ALSPAC recruited 14,541 pregnant women resident in Avon, UK with expected dates of delivery 1st April 1991 to 31st December 1992. 14,541 is the initial number of pregnancies for which the mother enrolled in the ALSPAC study and had either returned at least one questionnaire or attended a "Children in Focus" clinic by 19/07/99. Of these initial pregnancies, there was a total of 14,676 foetuses, resulting in 14,062 live births and 13,988 children who were alive at 1 year of age.
When the oldest children were approximately 7 years of age, an attempt was made to bolster the initial sample with eligible cases who had failed to join the study originally. As a result, when considering variables collected from the age of seven onwards (and potentially abstracted from obstetric notes) there are data available for more than the 14,541 pregnancies mentioned above.
The number of new pregnancies not in the initial sample (known as Phase I enrolment) that are currently represented on the built files and reflecting enrolment status at the age of 18 is 706 (452 and 254 recruited during Phases II and III respectively), resulting in an additional 713 children being enrolled. The phases of enrolment are described in more detail in the cohort profile paper.
The total sample size for analyses using any data collected after the age of seven is therefore 15,454 pregnancies, resulting in 15,589 foetuses. Of this total sample of 15,656 foetuses, 14,973 were live births and 14,899 were alive at 1 year of age.
A 10% sample of the ALSPAC cohort, known as the Children in Focus (CiF) group, attended clinics at the University of Bristol at various time intervals between 4 to 61 months of age. The CiF group were chosen at random from the last 6 months of ALSPAC births (1432 families attended at least one clinic). Excluded were those mothers who had moved out of the area or were lost to follow-up, and those partaking in another study of infant development in Avon.

Data management
Study data were collected and managed using REDCap electronic data capture tools hosted at University of Bristol (Harris et al. 2009). REDCap (Research Electronic Data Capture) is a secure, web-based application designed to support data capture for research studies, providing 1) an intuitive interface for validated data entry; 2) audit trails for tracking data manipulation and export procedures; 3) automated export procedures for seamless data downloads to common statistical packages; and 4) procedures for importing data from external sources.

Details of ethics approvals
Consent for biological samples has been collected in accordance with the Human Tissue Act (2004). Informed consent for the use of data collected via questionnaires and clinics was obtained from participants following the recommendations of the ALSPAC Ethics and Law Committee at the time. Details of ethics approvals relevant to this study:    M.Class = metabolite class (see plot for colour key); GPL = glycerophospholipid; G.group = genotype group (red = young participants, non-carriers; dark red = mothers, non-carriers; green = young participants, carriers of the 'A' allele; dark green = mothers, carriers of the 'A' allele
A series of population-based cohort studies including several meta-analyses have demonstrated that high plasma TAG levels are associated with increased risk of atherosclerotic CVD (Hokanson and Austin 1996;Sarwar et al. 2007). Lipid profiling with respect to cardiovascular events and related phenotypes have shown distinct patterns emerging within TAG subspecies such that the number of carbon atoms and double bonds appear to be relevant. Both a prospective study of CVD (Stegemann et al. 2014) and a randomized dietary intervention trial (Toledo et al. 2017), have shown TAGs to be differentially associated with CVD such that lipids with a lower number of carbon atoms in the acyl chain or with fewer double bonds were associated with a higher risk and those with a higher number of carbon atoms or more double bonds were associated with lower risk. Similar observations have been made with respect to body mass index (Ho et al. 2016) and diabetes prediction (Rhee et al. 2011).

Acyl-acyl GPL
39 associated with 38 of these showing a decrease in concentration in the rs138326449(APOC3) carriers.
Component of the hydrophilic surface of VLDL. The more common class of GPLs in which the hydrocarbon chain at the sn-1 position of the glycerol backbone is attached by an ester bond.
Properties depend on the specific molecule but there is some evidence that GPLs are relevant to disease. For example, diacyl-phosphatidylcholines have been associated with obesity (Bagheri et al. 2018(Bagheri et al. , 2019 and mortality (Sigruener et al. 2014a).

Relevant biology Relevance to disease
Acyl-alkyl GPL 10 associated with all of these showing an increase in concentration in the rs138326449(APOC3) carriers.
Component of the hydrophilic surface of VLDL. Peroxisome-derived GPLs that feature an ether bond in place of the ester bond seen in acyl-acyl GPLs and have multiple functions including incorporation into membranes and changes in membrane organisation, fluidity and stability, as endogenous antioxidants and in cell signalling pathways (Dean and Lodhi 2018).

DAG
14 associated with 10 of these showing a decrease in concentration in the rs138326449(APOC3) carriers.
DAGs are a metabolic product of TAG lipolysis by lipoprotein lipases. However, DAGs can be synthesised from other sources including de novo synthesis from fatty acids and GPL lysis and DAGs are precursors for GPL synthesis.
Increased concentrations of DAGs have previously been associated with increased risk of CVD (Toledo et al. 2017).
DAGs have also been proposed as mediators in the development of insulin resistance associated with obesity and type 2 diabetes, with APOC3 function highlighted as a potential mechanism for regulating intracellular DAG accumulation in muscle and liver via its inhibition of LPL activity (Erion and Shulman 2010). This hypothesis is in line with the previously documented effect of APOC3 variants on insulin resistance (Petersen et al. 2010;Waterworth et al. 2003).