Abstract
Nuclear Magnetic Resonance (NMR) Spectroscopy is the second most used technique (after X-ray crystallography) for structural determination of proteins. A computational challenge in this technique involves solving a discrete optimization problem that assigns the resonance frequency to each atom in the protein. This paper introduces LIAN (LInear programming Assignment for NMR), a novel linear programming formulation of the problem which yields state-of-the-art results in simulated and experimental datasets.
Similar content being viewed by others
References
Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows: Theory, Algorithms, and Applications. Prentice-Hall Inc, Upper Saddle River (1993)
Alipanahi, B., Gao, X., Karakoc, E., Li, S.C., Balbach, F., Feng, G., Donaldson, L., Li, M.: Error tolerant NMR backbone resonance assignment and automated structure generation. J. Bioinform. Comput. Biol. 9(1), 15–41 (2011)
Allain, F., Mareuil, F., Ménager, H., Nilges, M., Bardiaux, B.: ARIAweb: a server for automated NMR structure calculation. Nucleic Acids Research 48(W1), W41–W47 (2020). https://doi.org/10.1093/nar/gkaa362
Bahrami, A., Assadi, A.H., Markley, J.L., Eghbalnia, H.R.: Probabilistic interaction network of evidence algorithm and its application to complete labeling of peak lists from protein nmr spectroscopy. PLOS Comput. Biol. 5(3), 1–15 (2009). https://doi.org/10.1371/journal.pcbi.1000307
Bailey-Kellogg, C., Chainraj, S., Pandurangan, G.: A random graph approach to NMR sequential assignment. J. Comput. Biol. 12(6), 569–583 (2005)
Bang-Jensen, J., Gutin, G.Z.: Digraphs: Theory, Algorithms and Applications. Springer, London (2008)
Baran, M.C., Huang, Y.J., Moseley, H.N.B., Montelione, G.T.: Automated analysis of protein NMR assignments and structures. Chem. Rev. 104(8), 3541–3556 (2004). https://doi.org/10.1021/cr030408p. PMID: 15303826
Bartels, C., Güntert, P., Billeter, M., Wüthrich, K.: Garant-a general algorithm for resonance assignment of multidimensional nuclear magnetic resonance spectra. J. Comput. Chem. 18(1), 139–149 (1997)
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000)
Bodenhausen, G., Ruben, J.D.: Natural abundance nitrogen-15 NMR by enhanced heteronuclear spectroscopy. Chem. Phys. Lett. 69, 185–189 (1980)
Bromiley, P.: Products and convolutions of gaussian probability density functions. Tina-Vision Memo 3(4), 1 (2003)
Cavanagh, J., Fairbrother, W.J., Palmer, A.G., Rance, M., Skelton, N.J.: Protein NMR Spectroscopy, 1st edn. Academic Press Limited, London (1996)
Coggins, B.E., Zhou, P.: PACES: Protein sequential assignment by computer-assisted exhaustive search. J. Biomol. NMR 26(2), 93–111 (2003)
Donald, B.R.: Algorithms in Structural Molecular Biology. The MIT Press, Cambridge (2011)
Donald, B.R., Martin, J.: Automated NMR assignment and protein structure determination using sparse dipolar coupling constraints. Prog. Nuclear Magn. Resonance Spectrosc. 55(2), 101–127 (2009). https://doi.org/10.1016/j.pnmrs.2008.12.001
Ferreira, J.F.S.B., Khoo, Y., Singer, A.: Semidefinite programming approach for the quadratic assignment problem with a sparse graph. Comput. Optim. Appl. 69(3), 677–712 (2018). https://doi.org/10.1007/s10589-017-9968-9968-8
Grzesiek, S., Bax, A.: Correlating backbone amide and side chain resonances in larger proteins by multiple relayed triple resonance NMR. J. Am. Chem. Soc. 114(16), 6291–6293 (1992). https://doi.org/10.1021/ja00042a003
Grzesiek, S., Bax, A.: An efficient experiment for sequential backbone assignment of medium-sized isotopically enriched proteins. J. Magn. Resonance 99(1), 201–207 (1969). https://doi.org/10.1016/0022-2364(92)90169-8
Grzesiek, S., Bax, A.: Amino acid type determination in the sequential assignment procedure of uniformly 13C/15N-enriched proteins. J. Biomol. NMR 3(2), 185–204 (1993)
Guerry, P., Herrmann, T.: Comprehensive Automation for NMR Structure Determination of Proteins, pp. 429–451. Humana Press, Totowa (2012). https://doi.org/10.1007/978-1-61779-480-3_22
Güntert, P., Buchner, L.: Combined automated NOE assignment and structure calculation with CYANA. J. Biomol. NMR 62(4), 453–471 (2015). https://doi.org/10.1007/s10858-015-9924-9
Güntert, P., Salzmann, M., Braun, D., Wüthrich, K.: Sequence-specific NMR assignment of proteins by global fragment mapping with the program mapper. J. Biomol. NMR 18(2), 129–137 (2000). https://doi.org/10.1023/A:1008318805889
Gurobi Optimization, L.: Gurobi optimizer reference manual (2020). http://www.gurobi.com
Hitchens, T.K., Lukin, J.A., Zhan, Y., McCallum, S.A., Rule, G.S.: MONTE: An automated Monte Carlo based approach to nuclear magnetic resonance assignment of proteins. J. Biomol. NMR 25(1), 1–9 (2003)
Jung, Y.S., Zweckstetter, M.: Mars—robust automatic backbone assignment of proteins. J. Biomol. NMR 30(1), 11–23 (2004). https://doi.org/10.1023/B:JNMR.0000042954.99056.ad
Karjalainen, M., Tossavainen, H., Hellman, M., Permi, P.: HACANCOi: a new H-detected experiment for backbone resonance assignment of intrinsically disordered proteins. J. Biomol. NMR 74, 741 (2020)
Leutner, M., Gschwind, R.M., Liermann, J., Schwarz, C., Gemmecker, G., Kessler, H.: Automated backbone assignment of labeled proteins using the threshold accepting algorithm. J. Biomol. NMR 11(1), 31–43 (1998)
Lian, L.Y., Barsukov, I.L.: Resonance Assignments, chap. 3, pp. 55–82. Wiley-Blackwell, Hoboken (2011). https://doi.org/10.1002/9781119972006.ch3
Schmidt, E., Güntert, P.: A new algorithm for reliable and general NMR resonance assignment. J. Am. Chem. Soc. 134(30), 12817–12829 (2012). https://doi.org/10.1021/ja305091n. PMID: 22794163
Ulrich, E.L., Akutsu, H., Doreleijers, J.F., Harano, Y., Ioannidis, Y.E., Lin, J., Livny, M., Mading, S., Maziuk, D., Miller, Z., Nakatani, E., Schulte, C.F., Tolmie, D.E., Kent Wenger, R., Yao, H., Markley, J.L.: Biomagresbank. Nucleic Acids Res. 36(suppl 1), D402–D408 (2008). https://doi.org/10.1093/nar/gkm957
Volk, J., Herrmann, T., Wuthrich, K.: Automated sequence-specific protein NMR assignment using the memetic algorithm MATCH. J. Biomol. NMR 41(3), 127–138 (2008)
Wan, X., Lin, G.: CISA: Combined NMR resonance connectivity information determination and sequential assignment. IEEE/ACM Trans. Comput. Biol. Bioinform. 4(3), 336–348 (2007). https://doi.org/10.1109/tcbb.2007.1047
Yang, Y., Fritzsching, K.J., Hong, M.: Resonance assignment of the NMR spectra of disordered proteins using a multi-objective non-dominated sorting genetic algorithm. J. Biomol. NMR 57(3), 281–296 (2013)
Zeng, J., Zhou, P., Donald, B.R.: HASH: a program to accurately predict protein H\(\alpha \) shifts from neighboring backbone shifts. J. Biomol. NMR 55(1), 105–118 (2013)
Zimmerman, D.E., Kulikowski, C.A., Huang, Y., Feng, W., Tashiro, M., Shimotakahara, S., Ya Chien, C., Powers, R., Montelione, G.T.: Automated analysis of protein NMR assignments using methods from artificial intelligence. J. Mol. Biol. 269(4), 592–610 (1997). https://doi.org/10.1006/jmbi.1997.1052
Acknowledgements
A.S. was partially supported by NSF BIGDATA award IIS-1837992, NIH/NIGMS award 1R01GM136780-01, award FA9550-17-1-0291 from AFOSR, the Simons Foundation Math+X Investigator Award, and the Moore Foundation Data-Driven Discovery Investigator Award. DC was supported by NIH GM-117212.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Data and code availability
Data and preliminary (non-production) code used in simulations and tests is available in the author’s repository at https://github.com/fsbravo/lipras.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A: Grouping peaks
As we mentioned in Sect. 3.1.1, grouping consistent peaks together is a crucial step in the graph creation process for \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}})\). One would wish the enumeration of valid assignments to be as thorough as possible. We can effectively enumerate peak groupings to construct nodes in \({\mathcal {G}}\) by matching measured and expected peaks in a self-consistent way. In particular, we expect a specific set of peaks due to N–H\({}^{N}\) from residue k (see Fig. 10 for a standard example with three experiments) where the values of these peaks in \({\mathbb {R}}^3\) along certain dimensions are consistent. If there are n residues, we should have n sets of such expected peaks. Therefore, each layer in \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}})\) in principle should have n nodes, although in practice there are more nodes due ambiguities.
The notion of consistency can help significantly simplify the enumeration process (which would otherwise result in an exponential number of nodes). In order to efficiently enumerate consistent peak groupings, we do the following. Let \(\mathcal {S}_1, \ldots , \mathcal {S}_{L}\) be collections of measured peak lists corresponding to different heteronuclear experiments, i.e. \(\cup _{l=1}^L\mathcal {S}_l:=[p_1, \ldots , p_{m_2}]\). In the case of Fig. 10, \(L=3\), as we have peaks from three experiments. Now from these \(m_2\) experimental peaks we form all combinations of seven peaks that each consists of one peak from \({\mathcal {S}}_1\), two peaks from \({\mathcal {S}}_2\), and four peaks from \({\mathcal {S}}_3\) using the following criteria.
-
For any pair of \(p_u, p_v\) in a combination of seven peaks,
$$\begin{aligned} \vert p_u(1)-p_v(1)\vert&\le \delta _1 \\ \vert p_u(2)-p_v(2)\vert&\le \delta _2. \end{aligned}$$This means that the frequencies of the seven peaks in the N–H\({}^{N}\) dimension have to coincide up to tolerance \(\delta _1,\delta _2\).
-
Furthermore, for a combination of seven peaks, let \(p_u, p_v\) be the two peaks in \({\mathcal {S}}_2\). These peaks should coincide with two of the peaks in \({\mathcal {S}}_3\) (denoted \(p_i,p_j\)) up to tolerance \(\delta _3\), i.e.
$$\begin{aligned} \vert p_u(3)-p_i(3)\vert&\le \delta _3 \\ \vert p_v(3)-p_j(3)\vert&\le \delta _3 \end{aligned}$$along the \(\text {C}\) dimension.
B: Atom cost
Recall that we defined the cost of an atom, a, under a given set of assigned observations, \(\{x_l\}_{l=1}^{o_a}\) as
Definition 3
(Atom cost) The cost associated with atom a, with a normally distributed prior \(\mathcal {N}(\mu _a, \sigma _a)\), and \(o_a\) observations \(\{x_l^a\}_{l=1}^{o_a}\) defined by the peak grouping, also assumed to be normally distributed around the true frequency, \(\mu \), according to \(\mathcal {N}(\mu , \sigma _l)\) is defined as
where \(f(\cdot \mid u, v)\) is the Gaussian density with mean u and standard deviation v.
This is Definition 1 in the main text. Note that the term inside the expectation is a product of \(o_a\) univariate Gaussian probability density functions. Furthermore, expanding the expectation, we note that
by symmetry. Using a standard result regarding the product of univariate Gaussian PDFs (see, e.g., [11]), we can write
where
We see that this choice of cost function is therefore computationally advantageous, as the desired expectation is a simple function of the observations, \(\{x_l\}_{l=1}^{o_a}\) and of the distributional parameters of the prior, \((\mu _a, \sigma _a)\) and experiments, \(\{\sigma _l\}_{l=1}^{o_a}\). That said, it is certainly not the only cost function that one could use. As an example, we could instead solve a maximum likelihood problem for each peak grouping that would assign the highest likelihood frequency to each atom, given the prior and the observations. The exploration of alternative cost functions is left for future work.
C: Statistical Typing
Statistical typing is a process that happens both during the node and edge creation steps. In particular, we want to avoid the creation of nodes and edges which are too unlikely to constitute a valid assignment. The way we action on this notion is to define a threshold below which we would rather have a null assignment than the assignment induced by the relevant nodes. This threshold also determines the cost of the edges to (and from) the dummy nodes, which are therefore the highest cost edges in the graph.
For all simulations in this paper, we use the following definition:
Definition 4
(Atom cost threshold) The maximum allowable cost associated with atom a, with an expected frequency, \(\mu \), distributed according to the normally distributed prior \(\mathcal {N}(\mu _a, \sigma _a)\), and a total of \(o_a\) expected observations is given by:
where
That is, we define the maximum allowable cost for atom a by setting \(\{x^a_l\}_{l=1}^{o_a}\) in Definition 1 to \(\{w^a_l\}_{l=1}^{o_a}\), which constitute an adversarial realization of the observations. In this realization, the mean of the observations is \(\approx \delta \) standard deviations away from the prior mean, and the observations are split into two clusters, \(2\delta \) experimental standard deviations apart.
Rights and permissions
About this article
Cite this article
Bravo-Ferreira, J.F.S., Cowburn, D., Khoo, Y. et al. NMR assignment through linear programming. J Glob Optim 83, 3–28 (2022). https://doi.org/10.1007/s10898-021-01004-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-021-01004-3