Bayesian estimation of the number of protonation sites for urinary metabolites from NMR spectroscopic data

To aid the development of better algorithms for 1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^1$$\end{document}H NMR data analysis, such as alignment or peak-fitting, it is important to characterise and model chemical shift changes caused by variation in pH. The number of protonation sites, a key parameter in the theoretical relationship between pH and chemical shift, is traditionally estimated from the molecular structure, which is often unknown in untargeted metabolomics applications. We aim to use observed NMR chemical shift titration data to estimate the number of protonation sites for a range of urinary metabolites. A pool of urine from healthy subjects was titrated in the range pH 2–12, standard 1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^1$$\end{document}H NMR spectra were acquired and positions of 51 peaks (corresponding to 32 identified metabolites) were recorded. A theoretical model of chemical shift was fit to the data using a Bayesian statistical framework, using model selection procedures in a Markov Chain Monte Carlo algorithm to estimate the number of protonation sites for each molecule. The estimated number of protonation sites was found to be correct for 41 out of 51 peaks. In some cases, the number of sites was incorrectly estimated, due to very close pKa values or a limited amount of data in the required pH range. Given appropriate data, it is possible to estimate the number of protonation sites for many metabolites typically observed in 1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^1$$\end{document}H NMR metabolomics without knowledge of the molecular structure. This approach may be a valuable resource for the development of future automated metabolite alignment, annotation and peak fitting algorithms.


Introduction
1 H NMR is an important technique in metabolomics as it provides highly reproducible, quantitative information on a wide variety of metabolites. The chemical shift and multiplicity pattern are characteristic of the metabolite's chemical structure, but are complicated by small sample-to-sample changes in the position of individual resonances due to changes in pH, ionic strength or other physical parameters of the matrix (Fan 1996). While these can be ameliorated to some degree by careful analytical procedures, such as addition of buffers and control of physical conditions, changes in chemical shifts are still present in most NMR metabolomic data sets. Computational approaches to correct these changes, such as alignment, can introduce artefacts and are not able to correct shift changes which swap the ordering of resonances (Vu and Laukens 2013). Chemical shift changes can become a major problem in the statistical analysis of NMR metabolomics data, as they disrupt the linear relationship between NMR intensity at a given position and metabolite abundance (Ebbels and Cavill 2009). Thus it becomes important to characterise and model chemical shift changes (see e.g. Takis et al. 2017), in part to aid construction of better algorithms for data analysis, such as alignment or peakfitting. We recently reported titration model parameters such as acid/base limits and pKas for 33 identified metabolites 1 3 56 Page 2 of 8 in human urine, as well as titration curves for a further 65 unidentified peaks (Tredwell et al. 2016). A key problem in modelling NMR spectra from untargeted metabolomics is the unknown structure of the molecules giving rise to each resonance, and thus the lack of knowledge of important parameters. In particular, the number of proton binding sites strongly influences relationship of chemical shift to pH, but has traditionally been hard to infer from titration data alone. Here, we report the successful development and application of a Bayesian approach to estimating the number of proton binding sites in 1 H NMR metabolomics data, without knowledge of the molecule's chemical structure.

The model
As protonation is usually rapid and reversible on the NMR timescale, the theoretical chemical shift ( ̃ ) is a weighted average of the limiting chemical shifts of the unprotonated ( A ) and the protonated ( HA ) states of the molecule (Ackerman et al. 1996;Szakács et al. 2004). Ackerman et al. (1996) model the theoretical chemical shift as a function of pH and pKa as follows Szakács et al. (2004) extend this approach to molecules with q > 1 protonation sites: accounting for the interaction between protons bound at different binding sites and the statistics of proton binding.
From Eqs. (1, 2), it is evident that the theoretical chemical shift follows a titration curve which describes the position of the resonance over a range of pH. When the number of sites is known, nonlinear fitting can be applied using Eq.
(2) to model the titration curve to obtain the pKa values, as well as the acid and base chemical shift limits (Tredwell et al. 2016). However, in many metabolomics applications (for example alignment), the number of protonation sites may not be known, especially for unknowns or molecules of complex structure. Thus it is of interest to consider whether the number of protonation sites can be estimated along with the pH dependence of the chemical shift.
Here, we focus on inferring the number of protonation sites from observations of chemical shift changes for a given resonance at different pH values. Due to their small size, few (1) = A + HA (10 (pH−pKa) ) 1 + 10 (pH−pKa) metabolites have many protonation sites. We therefore limit the search space to 1-site, 2-site and 3-site models, although the approach can be easily extended to include more than 3 protonation sites if required. We employ a Bayesian approach because it provides a natural way of incorporating prior information and combining results of different experiments. In the Bayesian framework, it is, in principle, easy to incorporate model choice in the inferential process by specifying an appropriate prior distribution on the model space. Posterior inference is performed through Markov chain Monte Carlo (MCMC) methods. In this context, as model selection involves models with different dimensions, we employ a Reversible jump MCMC algorithm, which is implemented in the software JAGS (Plummer and Martyn 2003).
We propose a non-linear Bayesian regression model for each NMR resonance for each molecule. In particular, we assume that the observed chemical shift, y i , follows a normal distribution, with mean ̃ , representing the theoretical chemical shift, and variance 2 , the measurement error: The theoretical chemical shift ̃i is a function of the pH, pKa, A and the number of protonation sites as described in Eq. (2).

Specification of prior knowledge
Since most metabolites have up to three protonation sites, we specify as prior distribution on the number of protonation sites a uniform distribution on the set {1, 2, 3} . Therefore, each model is a priori equally likely. We complete the model by specifying a prior distribution on the remaining parameters. Assuming no additional spectral effects and conditioning on the number of sites q, we choose a uniform distribution defined over the NMR ppm scale [0, 10] as prior for A and H j A , j = 1, … , q.
Moreover, to improve efficiency in searching the parameter space and avoid identifiability issues (where different combinations of parameter values lead to the same likelihood value so that the model is not able to distinguish between them) we impose an order constraint on the and values, in descending or ascending order according to the trend of the data. This improves MCMC convergence and the accuracy of estimation. The order direction can be estimated, for example, by fitting a simple linear regression, = + b , to the data and considering the sign of the estimated slope parameter . If > 0 , the relationship between chemical shift and pH is approximately increasing a n d w e w o u l d i m p o s e t h e c o n s t r a i n t the other hand, if < 0 , we would impose restriction For most metabolites the change in chemical shift between adjacent protonation sites is smaller than 1ppm and the total shift change from most acidic to most basic peak position is also smaller than 1ppm. This allows us to assume that the change of chemical shift between adjacent protonation sites is smaller than 1ppm, i.e.
Finally, an Inverse-Gamma prior distribution with parameters ( a 2 , a ), which is often used as a Bayesian prior for error variance, is chosen for 2 . Note that a, which reflects the measurement error, should be chosen carefully according to the experiment. In our model, a = 10 4 is chosen based on empirical estimation of the measurement error related to the resolution of the spectrometer and its ability to measure peak position (Karakach et al. 2009).
We fit the model to each resonance independently. We pick as an estimate of q the number of protonation sites with highest posterior probability. We then refit the same model but fix q equal to its posterior estimate to obtain an estimate of the other parameters conditional on q. Posterior inference is performed in JAGS, running four chains of the MCMC algorithm for 50,000 iterations with a burn-in period of 25,000.

Prior specification for pKa
A great advantage of working in a Bayesian framework is the ability of the model to incorporate problem specific prior information. To assign informative prior knowledge on the pKa range, which aids computational stability and improves convergence of the MCMC algorithm, we exploit information available in the the Human Metabolome Database (version 4.0), which records the pKa values of many common urine metabolites.
By studying the empirical distribution of the pKa values downloaded from the database, we found that the distribution of pKa values has a heavy right tail. We choose as prior range for pKa [1.2, 13.7] to correspond to the pH range of our data. This range includes most urine metabolites reported in HMDB, but excludes values below the 7% and above the 90% percentile of the pKa distribution.

Data
Details of sample collection, NMR acquisition and data processing can be found in Tredwell et al. (2016). All data used in this study is publically available as Supplementary material to the original article under the Creative Commons attribution 4.0 International License https ://creat iveco mmons .org/licen ses/by/4.0/. Briefly, a urine sample was collected from five different individuals and pooled to obtain an average representative human urine sample. To avoid chemical shift effects from metal ions the urine was treated with chelex resin to reduce both Ca 2+ and Mg 2+ concentrations without significantly altering metabolite composition. Note that, while this results in non-physiological concentrations of these ions, it is not expected to affect the ability of the model to recover the number of protonation sites. The pool was then titrated to produce 51 samples covering the range 2 < < 12 . Spectra were acquired on a Bruker Avance DRX600 NMR spectrometer (Bruker BioSpin, Rheinstetten, Germany), with a 1 H frequency of 600 MHz. A one-dimensional NOESY sequence was used with water suppression, and data were acquired into 64k data points over a spectral width of 12 KHz, with eight dummy scans and 64 scans per sample. Spectra were processed in iNMR 3.4 (Nucleomatica, Molfetta, Italy). Fourier transform of the free-induction decay was applied with a line broadening of 0.5 Hz. Spectra were manually phased and automated first order baseline correction was applied. Metabolites were assigned using the Chenomx NMR Suite 5.1 (Chenomx, Inc., Edmonton, Alberta, Canada) relative to 4,4-dimethyl-4-silapentane-1-sulfonic acid (DSS) as an internal standard. Metabolite peak positions were obtained using in-house MATLAB scripts. Data for one metabolite (phenylalanine at 7.35 and 7.41 ppm) were discarded as it was found that the peak positions could not be measured accurately due to the high level of peak overlap in this region of the spectra.

Results and discussion
Our aim is to estimate the number of protonation sites for small molecule metabolites from their observed NMR pH titration curves. From Fig. 1, it is clear that when the number of protonation sites is estimated correctly, the chemical shift changes match the data quite well.
A summary of the results is shown in Table 1. More detailed results for each resonance can be found in Table 2. Of the 51 resonances, the estimated number of sites matches that found in the literature in 41 cases ( 80.4% ). It is evident that most of the incorrect predictions (10 out of 51) result from an underestimation of the number of sites compared to the literature value. The literature site numbers are sourced from handbook of biochemistry and molecular biology (Lundblad and Macdonald 2010). Where this was not possible, (hydroxyisobutyrate, hydroxyisovalerate, indoxyl and methyl-2-oxovalerate) the number was determined from an assessment of the molecular structure.
Given the estimation of the number of protonation sites, the other parameters of the model (acid limits, base limits and pKa values) can be estimated using the same model.
The modelled pKa values closely agree with the literature values (Lundblad and Macdonald 2010), and the modelled acid and base limits are also in good agreement with the previously modelled values (Tredwell et al. 2016). Therefore we do not present these in detail here. Four examples including 1, 2 and 3 protonation sites, (acetate, alanine, threonine and TTMethylHistidine) are shown in Table 3 and Fig. 2.

Metabolites with incorrectly estimated number of protonation sites
The model failed to estimate the correct number of protonation sites for 10 out of 51 resonances. There are several types of problem leading to incorrect estimation of the number of protonation sites. The first type ocurrs when at least one literature pKa value lies outside the range of the observed data. Taurine is a good example of this, as shown in Fig. 3, where it can be seen that one pKa lies at pH 1.5, while the data only cover the pH range 3.2-12. The second type of inaccurate estimation happens when two adjacent pKas are so close that the change in chemical shift between them is too small compared to the measurement error. The 2.7 resonance of citrate is a good example of this, as in Fig. 3, where the smooth titration curve around pH 4-5 does not suggest the presence of the third pKa at     Fig. 3. Conversely, the change in the chemical shift can be too large compared to the estimated measurement error, for example imidazole as shown in Fig. 3, forming a fourth type of inaccuracy. Some molecules have multiple resonances and so the question arises of whether to combine them, or if not, how to pick the best resonance to model. We do not recommend to combine resonances from the same molecule as, with our data, this tended to over estimate the number of protonation sites leading to a poorer fit. Instead, it is preferred to pick a resonance with "good behaviour", i.e. one which is not overlapped, shows strong changes in chemical shift, but with a good number of observations near each chemical shift transition (near the pKa). When more than one resonance from the same molecule are modelled and give different predictions for the number of sites, we recommend to use information such as the model fit error to judge which estimation is more reliable. We note that this does not apply in fully untargeted analysis when the metabolites are unidentified, and thus one does not know if two resonances come from the same molecule.

Conclusions
The Bayesian fit based on the model of Szakács et al. (2004) can effectively estimate the number of protonation sites for many small molecule metabolites, given sufficient pH titration data. Incorrect estimations are mainly due to cases where pKa values are very similar, and thus could not be distinguished, and/or a lack of data in the necessary pH ranges. We note that, even when the number of sites was incorrectly estimated, it is still possible to estimate the chemical shift position of a resonance quite accurately in most cases. The information obtained from the modelling procedure described here could be useful in a number of ways. For example, the pH could be estimated from the positions of a few well known and easily located resonances. This could then be used to predict the chemical shift positions of resonances of other metabolites expected in a sample, which could then help with automated annotation, alignment or peak fitting (as an initial position estimate). The predicted number of protonation sites may also be helpful during the process of identifying unknown compounds, although orthogonal analytical information would almost always be needed in addition. Overall, we hope that this modelling approach may be valuable for the future development of algorithms for analysis of metabolomic 1 H NMR spectra including alignment, annotation and peak fitting.