# Bayesian estimation of the number of protonation sites for urinary metabolites from NMR spectroscopic data

## Abstract

### Introduction

To aid the development of better algorithms for \(^1\)H NMR data analysis, such as alignment or peak-fitting, it is important to characterise and model chemical shift changes caused by variation in pH. The number of protonation sites, a key parameter in the theoretical relationship between pH and chemical shift, is traditionally estimated from the molecular structure, which is often unknown in untargeted metabolomics applications.

### Objective

We aim to use observed NMR chemical shift titration data to estimate the number of protonation sites for a range of urinary metabolites.

### Methods

A pool of urine from healthy subjects was titrated in the range pH 2–12, standard \(^1\)H NMR spectra were acquired and positions of 51 peaks (corresponding to 32 identified metabolites) were recorded. A theoretical model of chemical shift was fit to the data using a Bayesian statistical framework, using model selection procedures in a Markov Chain Monte Carlo algorithm to estimate the number of protonation sites for each molecule.

### Results

The estimated number of protonation sites was found to be correct for 41 out of 51 peaks. In some cases, the number of sites was incorrectly estimated, due to very close pKa values or a limited amount of data in the required pH range.

### Conclusions

Given appropriate data, it is possible to estimate the number of protonation sites for many metabolites typically observed in \(^1\)H NMR metabolomics without knowledge of the molecular structure. This approach may be a valuable resource for the development of future automated metabolite alignment, annotation and peak fitting algorithms.

## Keywords

NMR pH Peak shift changes Protonation site Bayesian model selection## 1 Introduction

\(^1\)H NMR is an important technique in metabolomics as it provides highly reproducible, quantitative information on a wide variety of metabolites. The chemical shift and multiplicity pattern are characteristic of the metabolite’s chemical structure, but are complicated by small sample-to-sample changes in the position of individual resonances due to changes in pH, ionic strength or other physical parameters of the matrix (Fan 1996). While these can be ameliorated to some degree by careful analytical procedures, such as addition of buffers and control of physical conditions, changes in chemical shifts are still present in most NMR metabolomic data sets. Computational approaches to correct these changes, such as alignment, can introduce artefacts and are not able to correct shift changes which swap the ordering of resonances (Vu and Laukens 2013). Chemical shift changes can become a major problem in the statistical analysis of NMR metabolomics data, as they disrupt the linear relationship between NMR intensity at a given position and metabolite abundance (Ebbels and Cavill 2009). Thus it becomes important to characterise and model chemical shift changes (see e.g. Takis et al. 2017), in part to aid construction of better algorithms for data analysis, such as alignment or peak-fitting. We recently reported titration model parameters such as acid/base limits and pKas for 33 identified metabolites in human urine, as well as titration curves for a further 65 unidentified peaks (Tredwell et al. 2016). A key problem in modelling NMR spectra from untargeted metabolomics is the unknown structure of the molecules giving rise to each resonance, and thus the lack of knowledge of important parameters. In particular, the number of proton binding sites strongly influences relationship of chemical shift to pH, but has traditionally been hard to infer from titration data alone. Here, we report the successful development and application of a Bayesian approach to estimating the number of proton binding sites in \(^1\)H NMR metabolomics data, without knowledge of the molecule’s chemical structure.

## 2 Methods

### 2.1 The model

From Eqs. (1, 2), it is evident that the theoretical chemical shift follows a titration curve which describes the position of the resonance over a range of pH. When the number of sites is known, nonlinear fitting can be applied using Eq. (2) to model the titration curve to obtain the pKa values, as well as the acid and base chemical shift limits (Tredwell et al. 2016). However, in many metabolomics applications (for example alignment), the number of protonation sites may not be known, especially for unknowns or molecules of complex structure. Thus it is of interest to consider whether the number of protonation sites can be estimated along with the pH dependence of the chemical shift.

Here, we focus on inferring the number of protonation sites from observations of chemical shift changes for a given resonance at different pH values. Due to their small size, few metabolites have many protonation sites. We therefore limit the search space to 1-site, 2-site and 3-site models, although the approach can be easily extended to include more than 3 protonation sites if required. We employ a Bayesian approach because it provides a natural way of incorporating prior information and combining results of different experiments. In the Bayesian framework, it is, in principle, easy to incorporate model choice in the inferential process by specifying an appropriate prior distribution on the model space. Posterior inference is performed through Markov chain Monte Carlo (MCMC) methods. In this context, as model selection involves models with different dimensions, we employ a Reversible jump MCMC algorithm, which is implemented in the software JAGS (Plummer and Martyn 2003).

### 2.2 Specification of prior knowledge

Since most metabolites have up to three protonation sites, we specify as prior distribution on the number of protonation sites a uniform distribution on the set \(\{1,2,3\}\). Therefore, each model is a priori equally likely. We complete the model by specifying a prior distribution on the remaining parameters. Assuming no additional spectral effects and conditioning on the number of sites *q*, we choose a uniform distribution defined over the NMR ppm scale [0, 10] as prior for \(\delta _A\) and \(\delta _{H_jA}, j=1,\ldots ,q\).

Moreover, to improve efficiency in searching the parameter space and avoid identifiability issues (where different combinations of parameter values lead to the same likelihood value so that the model is not able to distinguish between them) we impose an order constraint on the \(\mathbf {\delta _A}\) and \(\mathbf {\delta _{H_jA}}\) values, in descending or ascending order according to the trend of the data. This improves MCMC convergence and the accuracy of estimation. The order direction can be estimated, for example, by fitting a simple linear regression, \(\mathbf {y} = \beta \mathbf{pH} + b\), to the data and considering the sign of the estimated slope parameter \(\beta\). If \(\beta> 0\), the relationship between chemical shift and pH is approximately increasing and we would impose the constraint \(\delta _{A}> \delta _{{H_{1} A}}> \delta _{{H_{2} A}}> \cdots> \delta _{{H_{q} A}}\) on the parameter space. On the other hand, if \(\beta<0\), we would impose restriction \(\delta _{A} < \delta _{{H_{1} A}} < \delta _{{H_{2} A}} < \cdots < \delta _{{H_{q} A}}\).

*a*, which reflects the measurement error, should be chosen carefully according to the experiment. In our model, \(a = 10^4\) is chosen based on empirical estimation of the measurement error related to the resolution of the spectrometer and its ability to measure peak position (Karakach et al. 2009).

We fit the model to each resonance independently. We pick as an estimate of *q* the number of protonation sites with highest posterior probability. We then refit the same model but fix *q* equal to its posterior estimate to obtain an estimate of the other parameters conditional on *q*. Posterior inference is performed in JAGS, running four chains of the MCMC algorithm for 50,000 iterations with a burn-in period of 25,000.

### 2.3 Prior specification for pKa

A great advantage of working in a Bayesian framework is the ability of the model to incorporate problem specific prior information. To assign informative prior knowledge on the pKa range, which aids computational stability and improves convergence of the MCMC algorithm, we exploit information available in the the Human Metabolome Database (version 4.0), which records the pKa values of many common urine metabolites.

By studying the empirical distribution of the pKa values downloaded from the database, we found that the distribution of pKa values has a heavy right tail. We choose as prior range for pKa [1.2, 13.7] to correspond to the pH range of our data. This range includes most urine metabolites reported in HMDB, but excludes values below the 7% and above the 90% percentile of the pKa distribution.

### 2.4 Data

Details of sample collection, NMR acquisition and data processing can be found in Tredwell et al. (2016). All data used in this study is publically available as Supplementary material to the original article under the Creative Commons attribution 4.0 International License https://creativecommons.org/licenses/by/4.0/. Briefly, a urine sample was collected from five different individuals and pooled to obtain an average representative human urine sample. To avoid chemical shift effects from metal ions the urine was treated with chelex resin to reduce both \(\text {Ca}^{2+}\) and \(\text {Mg}^{2+}\) concentrations without significantly altering metabolite composition. Note that, while this results in non-physiological concentrations of these ions, it is not expected to affect the ability of the model to recover the number of protonation sites. The pool was then titrated to produce 51 samples covering the range \(2<\mathtt {pH}<12\). Spectra were acquired on a Bruker Avance DRX600 NMR spectrometer (Bruker BioSpin, Rheinstetten, Germany), with a \(^1\)H frequency of 600 MHz. A one-dimensional NOESY sequence was used with water suppression, and data were acquired into 64k data points over a spectral width of 12 KHz, with eight dummy scans and 64 scans per sample. Spectra were processed in iNMR 3.4 (Nucleomatica, Molfetta, Italy). Fourier transform of the free-induction decay was applied with a line broadening of 0.5 Hz. Spectra were manually phased and automated first order baseline correction was applied. Metabolites were assigned using the Chenomx NMR Suite 5.1 (Chenomx, Inc., Edmonton, Alberta, Canada) relative to 4,4-dimethyl-4-silapentane-1-sulfonic acid (DSS) as an internal standard. Metabolite peak positions were obtained using in-house MATLAB scripts. Data for one metabolite (phenylalanine at 7.35 and 7.41 ppm) were discarded as it was found that the peak positions could not be measured accurately due to the high level of peak overlap in this region of the spectra.

## 3 Results and discussion

Our aim is to estimate the number of protonation sites for small molecule metabolites from their observed NMR pH titration curves. From Fig. 1, it is clear that when the number of protonation sites is estimated correctly, the chemical shift changes match the data quite well.

A summary of the results is shown in Table 1. More detailed results for each resonance can be found in Table 2. Of the 51 resonances, the estimated number of sites matches that found in the literature in 41 cases (\(80.4\%\)). It is evident that most of the incorrect predictions (10 out of 51) result from an underestimation of the number of sites compared to the literature value. The literature site numbers are sourced from handbook of biochemistry and molecular biology (Lundblad and Macdonald 2010). Where this was not possible, (hydroxyisobutyrate, hydroxyisovalerate, indoxyl and methyl-2-oxovalerate) the number was determined from an assessment of the molecular structure.

Comparison of the literature number of sites and the number estimated by the model

Estimated number of sites | Total | ||||
---|---|---|---|---|---|

1 | 2 | 3 | |||

Literature number of sites | 1 | | 1 | 0 | 26 |

2 | 5 | | 0 | 14 | |

3 | 0 | 4 | | 11 | |

Total | 30 | 14 | 7 | 51 |

Probability of different numbers of protonation sites, estimated number of protonation sites and literature number of protonation sites for 51 resonances from 32 metabolites in human urine

Metabolite | Database ID | Chemical shift at pH7.4 | 1 Site prob. | 2 Site prob. | 3 Site prob. | Estimated number of sites | Literature number of sites |
---|---|---|---|---|---|---|---|

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

Citrate | HMDB0000094 | 2.528 | 0 | 75.404 | 24.596 | 2 | 3 |

| | | | | | | |

Creatinine | HMDB0000562 | 3.033 | 87.889 | 11.593 | 0.518 | 1 | 2 |

Creatinine | HMDB0000562 | 4.043 | 94.992 | 4.65 | 0.358 | 1 | 2 |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

Imidazole | HMDB0001525 | 8.040 | 0 | 73.582 | 26.418 | 2 | 1 |

| | | | | | | |

| | | | | | | |

Leucine | HMDB0000687 | 0.932 | 83.263 | 14.099 | 2.638 | 1 | 2 |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

TTMethylHistidine | HMDB0000001 | 3.788 | 0 | 94.773 | 5.227 | 2 | 3 |

| | | | | | | |

| | | | | | | |

| | | | | | | |

Taurine | HMDB0000251 | 3.412 | 86.863 | 7.137 | 6 | 1 | 2 |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

Tryptophan | HMDB0000929 | 7.719 | 87.38 | 11.156 | 1.464 | 1 | 2 |

Tyrosine | HMDB0000158 | 6.885 | 0 | 90.202 | 9.798 | 2 | 3 |

Tyrosine | HMDB0000158 | 7.207 | 0 | 90.592 | 9.408 | 2 | 3 |

| | | | | | | |

| | | | | | | |

| | | | | | | |

| | | | | | | |

Given the estimation of the number of protonation sites, the other parameters of the model (acid limits, base limits and pKa values) can be estimated using the same model. The modelled pKa values closely agree with the literature values (Lundblad and Macdonald 2010), and the modelled acid and base limits are also in good agreement with the previously modelled values (Tredwell et al. 2016). Therefore we do not present these in detail here. Four examples including 1, 2 and 3 protonation sites, (acetate, alanine, threonine and TTMethylHistidine) are shown in Table 3 and Fig. 2.

Literature and modelled results of acetate, alanine, threonine and TTMethylHistidine

Metabolite | Literature pKa values | Modelled pKa values | Modelled acid and base limits | |||||||
---|---|---|---|---|---|---|---|---|---|---|

Acetate | 4.760 | 4.591 | 1.910 | 2.089 | ||||||

Alanine | 2.340 | 9.690 | 2.384 | 9.980 | 1.212 | 1.472 | 1.573 | |||

Threonine | 2.630 | 10.430 | 2.072 | 9.195 | 1.194 | 1.322 | 1.379 | |||

TTMethylHistidine | 1.690 | 6.480 | 8.850 | 1.832 | 6.062 | 9.302 | 6.910 | 7.040 | 7.390 | 7.491 |

### 3.1 Metabolites with incorrectly estimated number of protonation sites

The model failed to estimate the correct number of protonation sites for 10 out of 51 resonances. There are several types of problem leading to incorrect estimation of the number of protonation sites. The first type ocurrs when at least one literature pKa value lies outside the range of the observed data. Taurine is a good example of this, as shown in Fig. 3, where it can be seen that one pKa lies at pH 1.5, while the data only cover the pH range 3.2–12.

The second type of inaccurate estimation happens when two adjacent pKas are so close that the change in chemical shift between them is too small compared to the measurement error. The \(\delta\) 2.7 resonance of citrate is a good example of this, as in Fig. 3, where the smooth titration curve around pH 4–5 does not suggest the presence of the third pKa at 4.75. A third type of incorrect estmate happens when the change of chemical shift is too small so that the transition can not be detected near the pKa value, for instance creatinine as shown in Fig. 3. Conversely, the change in the chemical shift can be too large compared to the estimated measurement error, for example imidazole as shown in Fig. 3, forming a fourth type of inaccuracy.

Some molecules have multiple resonances and so the question arises of whether to combine them, or if not, how to pick the best resonance to model. We do not recommend to combine resonances from the same molecule as, with our data, this tended to over estimate the number of protonation sites leading to a poorer fit. Instead, it is preferred to pick a resonance with “good behaviour”, i.e. one which is not overlapped, shows strong changes in chemical shift, but with a good number of observations near each chemical shift transition (near the pKa). When more than one resonance from the same molecule are modelled and give different predictions for the number of sites, we recommend to use information such as the model fit error to judge which estimation is more reliable. We note that this does not apply in fully untargeted analysis when the metabolites are unidentified, and thus one does not know if two resonances come from the same molecule.

## 4 Conclusions

The Bayesian fit based on the model of Szakács et al. (2004) can effectively estimate the number of protonation sites for many small molecule metabolites, given sufficient pH titration data. Incorrect estimations are mainly due to cases where pKa values are very similar, and thus could not be distinguished, and/or a lack of data in the necessary pH ranges. We note that, even when the number of sites was incorrectly estimated, it is still possible to estimate the chemical shift position of a resonance quite accurately in most cases. The information obtained from the modelling procedure described here could be useful in a number of ways. For example, the pH could be estimated from the positions of a few well known and easily located resonances. This could then be used to predict the chemical shift positions of resonances of other metabolites expected in a sample, which could then help with automated annotation, alignment or peak fitting (as an initial position estimate). The predicted number of protonation sites may also be helpful during the process of identifying unknown compounds, although orthogonal analytical information would almost always be needed in addition. Overall, we hope that this modelling approach may be valuable for the future development of algorithms for analysis of metabolomic \(^1\) H NMR spectra including alignment, annotation and peak fitting.

## Notes

### Acknowledgements

The authors are grateful to Dr Jake Bundy and Dr Gregory Tredwell for use of the titration series data.

### Author contributions

TMDE and MDI conceived and designed the research. LY performed the research. All authors read and approved the manuscript.

### Funding

The funding was provided by Horizon 2020 Framework Programme (Grant No. EC654241) and China Sponsorship Council.

### Compliance with ethical standards

### Conflict of interest

Lifeng Ye, Maria De Iorio and Timothy M. D. Ebbels declare that they have no conflict of interest.

### Ethical approval

This study analysed previously collected data which involved human participants who had provided informed consent. These ethical issues are described in detail in Tredwell et al. (2016).

### Informed consent

Informed consent was obtained from all individual participants included in the study.

## References

- Ackerman, J. J. H., Soto, G. E., Spees, W. M., Zhu, Z., & Evelhoch, J. L. (1996) The NMR chemical shift pH measurement revisited: Analysis of error and modeling of a pH dependent reference.
*Magnetic Resonance in Medicine, 36*(5), 674–683.CrossRefPubMedGoogle Scholar - Ebbels, T., & Cavill, R. (2009). Bioinformatic methods in NMR-based metabolic profiling.
*Progress in Nuclear Magnetic Resonance Spectroscopy*,*55*(4), 361–374.CrossRefGoogle Scholar - Fan, T. W. M. (1996). Metabolite profiling by one- and two-dimensional NMR analysis of complex mixtures.
*Progress in Nuclear Magnetic Resonance Spectroscopy*,*28*, 161–219.CrossRefGoogle Scholar - HMDB CA. (2017). Human metabolome database. http://www.hmdb.ca. Accessed 10 Oct 2017.
- Karakach, T., Wentzell, P., & Walter, J. (2009). Characterization of the measurement error structure in 1D 1H NMR data for metabolomics studies.
*Analytica Chimica Acta*,*636*(2), 163–174.CrossRefPubMedGoogle Scholar - Lundblad, R., & Macdonald, F. (2010).
*Handbook of biochemistry and molecular biology*. Cleveland, OH: CRC Press.Google Scholar - Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In
*Proceedings of the 3rd international workshop on distributed statistical computing*(vol. 124).Google Scholar - Szakács, Z., Hägele, G., & Tyka, R. (2004) 1H/31P NMR pH indicator series to eliminate the glass electrode in NMR spectroscopic pKa determinations.
*Analytica Chimica Acta, 522*(2), 247–258.CrossRefGoogle Scholar - Takis, P. G., SchÃd’fer, H., Spraul, M., & Luchinat, C. (2017). Deconvoluting interrelationships between concentrations and chemical shifts in urine provides a powerful analysis tool.
*Nature Communications*,*8*(1), 1662.CrossRefPubMedPubMedCentralGoogle Scholar - Tredwell, G., Bundy, J., De Iorio, M., & Ebbels, T. (2016). Modelling the acid/base 1H NMR chemical shift limits of metabolites in human urine.
*Metabolomics*,*12*(10), 152.CrossRefPubMedPubMedCentralGoogle Scholar - Vu, T., & Laukens, K. (2013). Getting your peaks in line: A review of alignment methods for NMR spectral data.
*Metabolites*,*3*(2), 259–276.CrossRefPubMedPubMedCentralGoogle Scholar - Wishart, D. S., Jewison, T., Guo, A. C., Wilson, M., Knox, C., et al. (2012). HMDB 3.0—The human metabolome database in 2013.
*Nucleic Acids Research*,*41*(D1), D801–D807.CrossRefPubMedPubMedCentralGoogle Scholar - Wishart, D. S., Knox, C., Guo, A. C., et al. (2009). HMDB: A knowledgebase for the human metabolome.
*Nucleic Acids Research*,*37*(Database), D603–D610.CrossRefPubMedGoogle Scholar - Wishart, D. S., Tzur, D., Knox, C., et al. (2007). HMDB: The human metabolome database.
*Nucleic Acids Research*,*35*(Database), D521–D526.CrossRefPubMedPubMedCentralGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.