Abstract
Predicting three dimensional residue-residue contacts from evolutionary information in protein sequences was attempted already in the early 1990s. However, contact prediction accuracies of methods evaluated in CASP experiments before CASP11 remained quite low, typically with <20% true positives. Recently, contact prediction has been significantly improved to the level that an accurate three dimensional model of a large protein can be generated on the basis of predicted contacts. This improvement was attained by disentangling direct from indirect correlations in amino acid covariations or cosubstitutions between sites in protein evolution. Here, we review statistical methods for extracting causative correlations and various approaches to describe protein structure, complex, and flexibility based on predicted contacts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adhikari B, Bhattacharya D, Cao R, Cheng J (2015) CONFOLD: residue-residue contact-guided ab initio protein folding. Proteins 83:1436–1449. https://doi.org/10.1002/prot.24829
Adhikari B, Nowotny J, Bhattacharya D, Hou J, Cheng J (2016) ConEVA: a toolbox for comprehensive assessment of protein contacts. BMC Bioinf 17:517. https://doi.org/10.1186/s12859-016-1404-z
Altschuh D, Vernet T, Berti P, Moras D, Nagai K (1988) Coordinated amino acid changes in homologous protein families. Protein Eng 2:193–199
Anishchenko I, Ovchinnikov S, Kamisetty H, Baker D (2013) Origins of coevolution between residues distant in protein 3D structures. Proc Natl Acad Sci USA 114:9122–9127. https://doi.org/10.1073/pnas.1702664114
Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW (2000) Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol 17:164–178
Balakrishnan S, Kamisetty H, Carbonell JG, Lee SI, Langmead CJ (2011) Learning generative models for protein fold families. Proteins 79:1061–1078. https://doi.org/10.1002/prot.22934
Baldassi C, Zamparo M, Feinauer C, Procaccini A, Zecchina R, Weigt M, Pagnani A (2014) Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners. PLoS ONE 9(3):e92721. https://doi.org/10.1371/journal.pone.0092721
Barton JP, Leonardis ED, Coucke A, Cocco S (2016) ACE: adaptive cluster expansion for maximum entropy graphical model inference. Bioinformatics 32:3089–3097. https://doi.org/10.1093/bioinformatics/btw328
Braun W, Go N (1985) Calculation of protein conformations by proton-proton distance constraints: a new efficient algorithm. J Mol Biol 186:611–626. https://doi.org/10.1016/0022-2836(85)90134-2
Brünger AT (2007) Version 1.2 of the crystallography and NMR system. Nat Protoc 2:2728–2733. https://doi.org/10.1038/nprot.2007.406
Burger L, van Nimwegen E (2008) Acurate prediction of protein-protein interactions from sequence alignments using a Bayesian method. Mol Syst Biol 4:165
Burger L, van Nimwegen E (2010) Disentangling direct from indirect co-evolution of residues in protein alignments. PLoS Comput Biol 6(1):e1000633. https://doi.org/10.1371/journal.pcbi.1000633
CASP12 (2017) 12th community wide experiment on the critical assessment of techniques of protein structure prediction. http://predictioncenter.org/casp12/
Cocco S, Monasson R (2011) Adaptive cluster expansion for inferring Boltzmann machines with noisy data. Phys Rev Lett 106:090601. https://doi.org/10.1103/PhysRevLett.106.090601
Cocco S, Monasson R (2012) Adaptive cluster expansion for the inverse Ising problem: convergence, algorithm and tests. J Stat Phys 147:252–314. https://doi.org/10.1007/s10955-012-0463-4
Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M (2017) Inverse statistical physics of protein sequences: a key issues review. arXiv:1703.01222 [q-bio.BM]
Doron-Faigenboim A, Pupko T (2007) A combined empirical and mechanistic codon model. Mol Biol Evol 24:388–397
Dunn SD, Wahl LM, Gloor GB (2008) Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 24:333–340
Dutheil J (2012) Detecting coevolving positions in a molecule: why and how to account for phylogeny. Brief Bioinf 13:228–243
Dutheil J, Galtier N (2007) Detecting groups of coevolving positions in a molecule: a clustering approach. BMC Evol Biol 7:242
Dutheil J, Pupko T, Jean-Marie A, Galtier N (2005) A model-based approach for detecting coevolving positions in a molecule. Mol Biol Evol 22:1919–1928
Ekeberg M, Lövkvist C, Lan Y, Weigt M, Aurell E (2013) Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Phys Rev E 87:012707–1–16. https://doi.org/10.1103/PhysRevE.87.012707
Ekeberg M, Hartonen T, Aurell E (2014) Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences. J Comput Phys 276:341–356
Fares M, Travers S (2006) A novel method for detecting intramolecular coevolution. Genetics 173:9–23
Fariselli P, Olmea O, Valencia A, Casadio R (2001) Prediction of contact maps with neural networks and correlated mutations. Protein Eng 14:835–843
Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucl Acid Res 44:D279–D285. https://doi.org/10.1093/nar/gkv1344
Fitch WM, Markowitz E (1970) An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem Genet 4:579–593
Fleishman SJ, Yifrach O, Ben-Tal N (2004) An evolutionarily conserved network of amino acids mediates gating in voltage-dependent potassium channels. J Mol Biol 340:307–318
Fodor AA, Aldrich RW (2004) Influence of conservation on calculations of amino acid covariance in multiple sequence alignment. Proteins 56:211–221
Giraud BG, Heumann JM, Lapedes AS (1999) Superadditive correlation. Phys Rev E 59:4973–4991
Göbel U, Sander C, Schneider R, Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins 18:309–317
Gulyás-Kovács A (2012) Integrated analysis of residue coevolution and protein structure in ABC transporters. PLoS ONE 7(5):e36546. https://doi.org/10.1371/journal.pone.0036546
Halabi N, Rivoire O, Leibler S, Ranganathan R (2009) Protein sectors: evolutionary units of three-dimensional structure. Cell 138:774–786
Havel TF, Kuntz ID, Crippen GM (1983) The combinatorial distance geometry method for the calculation of molecular conformation. I. A new approach to an old problem. J Theor Biol 104:359–381
Hopf TA, Colwell LJ, Sheridan R, Rost B, Sander C, Marks DS (2012) Three-dimensional structures of membrane proteins from genomic sequencing. Cell 149:1607–1621. https://doi.org/10.1016/j.cell.2012.04.012
Hopf TA, Schärfe CPI, Rodrigues JPGLM, Green AG, Kohlbacher O, Bonvin, AMJJ, Sander C, Marks DS (2014) Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife 3:e03430. https://doi.org/10.7554/eLife.03430
Hopf TA, Ingraham JB, Poelwijk FJ, Schärfe CPI, Springer M, Sander C, Marks DS (2017) Mutation effects predicted from sequence co-variation. Nature Biotech 35:128–135. https://doi.org/10.1038/nbt.3769
Ingraham J, Marks D (2016) Variational inference for sparse and undirected models. arXiv:1602.03807 [stat.ML]
Jacquin H, Gilson A, Shakhnovich E, Cocco S, Monasson R (2016) Benchmarking inverse statistical approaches for protein structure and design with exactly solvable models. PLoS Comput Biol 12:e1004889. https://doi.org/10.1371/journal.pcbi.1004889
Johnson LS, Eddy SR, Portugaly E (2010) Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinf 11:431
Jones DT (2001) Predicting novel protein folds by using FRAGFOLD. Proteins 45(S5):127–132
Jones DT, Bryson K, Coleman A, McGuffin LJ, Sadowski MI, Sodhi JS, Ward JJ (2005) Prediction of novel and analogous folds using fragment assembly and fold recognition. Proteins 61(S7):143–151. https://doi.org/10.1002/prot.20731
Jones DT, Buchan DWA, Cozzetto D, Pontil M (2012) PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28:184–190. https://doi.org/10.1093/bioinformatics/btr638
Jones DT, Singh T, Kosciolek T, Tetchner S (2015) MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics 31:999–1006. https://doi.org/10.1093/bioinformatics/btu791
Kaján L, Hopf TA, Kalaš M, Marks DS, Rost B (2014) FreeContact: fast and free software for protein contact prediction from residue co-evolution. BMC Bioinf 15:85
Kamisetty H, Ovchinnikov S, Baker D (2013) Assessing the utility of coevolution-based residue- residue contact predictions in a sequence-and structure-rich era. Proc Natl Acad Sci USA 110:15674–15679. https://doi.org/10.1073/pnas.1314045110
Kim DE, Chivian D, Baker D (2004) Protein structure prediction and analysis using the Rosetta server. Nucl Acid Res 32:W526–W531
Kim DE, Blum B, Bradley P, Baker D (2009) Sampling bottlenecks in de novo protein structure prediction. J Mol Biol 393:249–260
Kosciolek T, Jones DT (2014) De novo structure prediction of globular proteins aided by sequence variation-derived contacts. PLoS ONE 9:e92197. https://doi.org/10.1371/journal.pone.0092197
Kosciolek T, Jones DT (2016) Accurate contact predictions using covariation techniques and machine learning. Proteins 84(S1):145–151. https://doi.org/10.1002/prot.24863
Lapedes AS, Giraud BG, Liu LC, Stormo GD (1999) Correlated mutations in protein sequences: phylogenetic and structural effects. In: Seillier-Moiseiwitsch F (ed) IMS lecture notes: statistics in molecular biology and genetics: selected proceedings of the joint AMS-IMS-SIAM summer conference on statistics in molecular biology, 22–26 June 1997, pp 345–352. Institute of Mathematical Statistics
Lapedes A, Giraud B, Jarzynsk C (2002) Using sequence alignments to predict protein structure and stability with high accuracy. LANL Sciece Magagine LA-UR-02-4481
Lapedes A, Giraud B, Jarzynsk C (2012) Using sequence alignments to predict protein structure and stability with high accuracy. arXiv:1207.2484 [q-bio.QM]
Maisnier-Patin S, Andersson DI (2004) Adaptation to the deleterious effect of antimicrobial drug resistance mutations by compensatory evolution. Res Microbiol 155:360–369
Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander C (2011) Protein 3D structure computed from evolutionary sequence variation. PLoS ONE 6(12):e28766. https://doi.org/10.1371/journal.pone.0028766
Marks DS, Hopf TA, Sander C (2012) Protein structure prediction from sequence variation. Nat Biotech 30:1072–1080. https://doi.org/10.1038/nbt.2419
Martin LC, Gloor GB, Dunn SD, Wahl LM (2005) Using information theory to search for co-evolving residues in proteins. Bioinformatics 21:4116–4124
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state calculations by fast computing machines. J Chem Phys 21:1087–1092
Miyazawa S (2013) Prediction of contact residue pairs based on co-substitution between sites in protein structures. PLoS ONE 8(1):e54252. https://doi.org/10.1371/journal.pone.0054252
Miyazawa S (2017a) Prediction of structures and interactions from genome information. arXiv:1709.08021 [q-bio.BM]
Miyazawa S (2017b) Selection originating from protein stability/foldability: relationships between protein folding free energy, sequence ensemble, and fitness. J Theor Biol 433:21–38. https://doi.org/10.1016/j.jtbi.2017.08.018
Miyazawa S, Jernigan RL (1996) Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term for simulation and threading. J Mol Biol 256:623–644. https://doi.org/10.1006/jmbi.1996.0114
Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, Zecchina R, Onuchic JN, Hwa T, Weigt M (2011) Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci USA 108:E1293–E1301. https://doi.org/10.1073/pnas.1111471108
Morcos F, Schafer NP, Cheng RR, Onuchic JN, Wolynes PG (2014) Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection. Proc Natl Acad Sci USA 111:12408–12413. https://doi.org/10.1073/pnas.1413575111
Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A (2016) Critical assessment of methods of protein structure prediction: progress and new directions in round XI. Proteins 84(S1):4–14. https://doi.org/10.1002/prot.25064
Nugent T, Jones DT (2012) Accurate de novo structure prediction of large transmembrane protein domains using fragmentassembly and correlated mutation analysis. Proc Natl Acad Sci USA 109:E1540–E1547. https://doi.org/10.1073/pnas.1120036109
Ovchinnikov S, Kim DE, Wang RYR, Liu Y, DiMaio F, Baker D (2016) Improved de novo structure prediction in CASP11 by incorporating coevolution information into Rosetta. Proteins 84(S1):67–75. https://doi.org/10.1002/prot.24974
Pazos F, Helmer-Citterich M, Ausiello G, Valencia A (1997) Correlated mutations contain information about protein-protein interaction. J Mol Biol 271:511–523
Pollock DD, Taylor WR (1997) Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng 10:647–657
Pollock DD, Taylor WR, Goldman N (1999) Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol 287:187–198
Poon AFY, Lewis FI, Frost SDW, Kosakovsky Pond SL (2008) Spidermonkey: rapid detection of co-evolving sites using Bayesian graphical models. Bioinformatics 24:1949–1950
Remmert M, Biegert A, Hauser A, Söding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175
Riedmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning: the RPROP algorithm. IEEE Int Conf Neural Netw 1993:586–591
Russ WP, Lowery DM, Mishra P, Yaffe MB, Ranganathan R (2005) Natural-like function in artificial WW domains. Nature 437:579–583
Seemayer S, Gruber M, Söding J (2014) CCMpred-fast and precise prediction of protein residue- residue contacts from correlated mutations. Bioinformatics 30:3128–3130. https://doi.org/10.1093/bioinformatics/btu500
Sfriso P, Duran-Frigola M, Mosca R, Emperador A, Aloy P, Orozco M (2016) Residues coevolution guides the systematic identification of altemative functional conformations in proteins. Structure 24:116–126. https://doi.org/10.1016/j.str.2015.10.025
Shendure J, Ji H (2017) EPSILON-CP: using deep learning to combine information from multiple sources for protein contact prediction. BMC Bioinf 18:303. https://doi.org/10.1186/s12859-017-1713-x
Shindyalov IN, Kolchanov NA, Sander C (1994) Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng 7:349–358
Skerker JM, Perchuk BS, Siryapom A, Lubin EA, Ashenberg O, Goulian M, Laub MT (2008) Rewiring the specificity of two-component signal transduction systems. Cell 133:1043–1054
Skwark MJ, Abdel-Rehim A, Elofsson A (2013) PconsC: combination of direct information methods and alignments improves contact prediction. Bioinformatics 29:1815–1816
Skwark MJ, Raimondi D, Michel M, Elofsson A (2014) Improved contact predictions using the recognition of protein like contact patterns. PLoS Comput Biol 10:e1003889. https://doi.org/10.1371/journal.pcbi.1003889
Skwark MJ, Michel M, Hurtado DM, Ekeberg M, Elofsson A (2016) Accurate contact predictions for thousands of protein families using PconsC3. bioRXiv. https://doi.org/10.1101/079673
Sufkowska JI, Morcos F, Weigt M, Hwa T, Onuchic JN (2012) Genomics-aided structure prediction. Proc Natl Acad Sci USA 109:10340–10345. https://doi.org/10.1073/pnas.1207864109
Sutto L, Marsili S, Valencia A, Gervasio FL (2015) From residue coevolution to protein conformational ensembles and functional dynamics. Proc Natl Acad Sci USA 112:13567–13572. https://doi.org/10.1073/pnas.1508584112
Talavera D, Lovell SC, Whelan S (2015) Covariation is a poor measure of molecular coevolution. Mol Biol Evol 32:2456-2468. https://doi.org/10.1093/molbev/msv109
Taylor WR, Sadowski MI (2011) Structural constraints on the covariance matrix derived from multiple aligned protein sequences. PLoS ONE 6(12):e28265. https://doi.org/10.1371/journal.pone.0028265
Tokuriki N, Tawfik DS (2009) Protein dynamism and evolvability. Science 324:203–207
Toth-Petroczy A, Palmedo P, Ingraham J, Hopf TA, Berger B, Sander C, Marks DS (2016) Structured states of disordered proteins from genomic sequences. Cell 167:158–170. https://doi.org/10.1016/j.cell.2016.09.010
Tufféry P, Darlu P (2000) Exploring a phylogenetic approach for the detection of correlated substitutions in proteins. Mol Biol Evol 17:1753–1759
Wang S, Sun S, Li Z, Zhang R, Xu J (2017) Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput Biol 13:e1004324. https://doi.org/10.1371/journal.pcbi.1005324
Weigt M, White RA, Szurmant H, Hoch JA, Hwa T (2009) Identification of direct residue contacts in protein-protein interaction by message passing. Proc Natl Acad Sci USA 106:67–72. https://doi.org/10.1073/pnas.0805923106
Weinreb C, Riesselman AJ, Ingraham JB, Gross T, Sander C, Marks DS (2016) 3D RNA and functional interactions from evolutionary couplings. Cell 165:1–13. https://doi.org/10.1016/j.cell.2016.03.030
Wuyun Q, Zheng W, Peng Z, Yang J (2016) A large-scale comparative assessment of methods for residue-residue contact prediction. Brief Bioinform 19:219–230. https://doi.org/10.1093/bib/bbw106
Yanovsky C, Hom V, Thorpe D Protein structure relationships revealed by mutation analysis. Science 146:1593–1594 (1964)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Appendix
Appendix
An appendix described in full will be found in the article (Miyazawa 2017a) submitted to the arXiv.
1.1 Inverse Potts Model
1.1.1 A Gauge Employed for h i(a k) and J ij(a k, a l)
Unless specified, a following gauge is employed; we call it q-gauge, here.
In this gauge, the amino acid a q is the reference state for fields and couplings, and P i(a q), P ij(a k, a q) = P ji(a q, a k), and P ij(a q, a q) are regarded as dependent variables. Common choices for the reference state a q are the most common (consensus) state at each site. Any gauge can be transformed to another by the following transformation.
where “⋅” denotes the reference state, which may be a q for each site (q-gauge) or the average over all states (Ising gauge).
1.1.2 Boltzmann Machine
Fields h i(a k) and couplings J ij(a k, a l) are estimated by iterating the following 2-step procedures.
-
1.
For a given set of h i and J ij(a k, a l), marginal probabilities, P MC(σ i = a k) and P MC(σ i = a k, σ i = a l), are estimated by a Markov chain Monte Carlo method (the Metropolis-Hastings algorithm (Metropolis et al. 1953)) or by any other method (for example, the message passing algorithm (Weigt et al. 2009)).
-
2.
Then, h i and J ij(a k, a l) are updated according to the gradient of negative log-posterior-probability per instance, ∂S 0∕∂h i(a k) or ∂S 0∕∂J ij(a k, a l), multiplied by a parameter-specific weight factor (Barton et al. 2016), w i(a k) or w ij(a k, a l); see Eqs. 9.8 and 9.12.
$$\displaystyle \begin{aligned} \varDelta h_{i}(a_{k})&=-(P^{\mathrm{MC}}(\sigma_{i}=a_{k})+\frac{\partial R}{\partial h_{i}(a_{k})}-P_{i}(a_{k}))\cdot w_{i}(a_{k}) {} \end{aligned} $$(9.19)$$\displaystyle \begin{aligned} \varDelta J_{ij}(a_{k},a_{l})&=-(P^{\mathrm{MC}}(\sigma_{i}=a_{k},\sigma_{i}=a_{l})+\frac{\partial R}{\partial J_{ij}(a_{k},a_{l})} \end{aligned} $$$$\displaystyle \begin{aligned} &\quad -P_{ij}(a_{k},a_{l}))\cdot w_{ij}(a_{k},a_{l}) {} \end{aligned} $$(9.20)where weights are also updated as w i(a k) ← f(w i(a k)) and w ij(a k, a l) ← f(w ij(a k, a l)) according to the RPROP (Riedmiller and Braun 1993) algorithm; the function f(w) is defined as
$$\displaystyle \begin{aligned} f(w)\equiv\left\{\begin{array}{ll} \max(w\cdot s_{-},w_{\min}) & \mathrm{if}\ \mathrm{the}\ \mathrm{gradient}\ \mathrm{changes}\ \mathrm{its}\ \mathrm{sign},\\ \min(w\cdot s_{+},w_{\max}) & \mathrm{otherwise} \end{array}\right. {} \end{aligned} $$(9.21)\(w_{\min }=10^{-3}\), \(w_{\max }=10\), s − = 0.5, and s + = 1.9 < 1∕s − were employed (Barton et al. 2016). After updated, h i(a k) and J ij(a k, a l) may be modified to satisfy a given gauge.
The Boltzmann machine has a merit that model correlations are calculated.
1.1.3 Gaussian Approximation for P(σ) with a Normal-Inverse-Wishart Prior
The normal-inverse-Wishart distribution (NIW) is the product of the multivariate normal distribution \((\mathcal {N})\) and the inverse-Wishart distribution \((\mathcal {W}^{-1})\), which are the conjugate priors for the mean vector and for the covariance matrix of a multivariate Gaussian distribution, respectively. The NIW is employed as a prior in GaussDCA (Baldassi et al. 2014), in which the sequence distribution P(σ) is approximated as a Gaussian distribution. In this approximation, the q-gauge is used, and P i(a q), P ij(a k, a q) = P ji(a q, a k), and P ij(a q, a q) are regarded as dependent variables; see section “A Gauge Employed for h i(a k) and J ij(a k, a l)”; in GaussDCA, deletion is excluded from independent variables.
The posterior distribution for the NIW is also a NIW. Thus, the cross entropy S 0 can be represented as
where \(\varGamma _{\dim \varSigma }(\nu /2)\) is the multivariate Γ function, μ is the mean vector, and \(\dim \varSigma \) is the dimension of covariance matrix Σ, \(\dim \varSigma =(q-1)L\) excluding deletion in GaussDCA. The normal and NIW distributions are defined as follows.
Parameters μ B, κ B, ν B, and Λ B satisfy
where the Λ and ν are the scale matrix and the degree of freedom, respectively, shaping the inverse-Wishart distribution, and C is the given covariance matrix; C ij(a k, a l) ≡ P ij(a k, a l) − P i(a k)P i(a l). The mean values of μ and Σ under NW posterior are μ B and \(\varLambda ^{B}/(\nu ^{B}-\dim \varSigma -1)\), and their mode values are μ B and \(\varLambda ^{B}/(\nu ^{B}+\dim \varSigma +1)\), which minimize the cross entropy or maximize the posterior probability. The covariance matrix Σ can be estimated to be the exactly same value by adjusting the value of ν, whichever the mean posterior or the maximum posterior is employed for the estimation of Σ. In GaussDCA, the mean posterior estimate was employed but here the maximum posterior estimate is employed according to the present formalism.
According to GaussDCA, ν is chosen in such a way that σ ij(a k, a l) is nearly equal to the covariance matrix corrected by pseudocount; \(\nu =\kappa +\dim \varSigma +1\) for the mean posterior estimate in GaussDCA, but \(\nu =\kappa -\dim \varSigma -1\) for the maximum posterior estimate here.
From Eq. 9.15, the estimates of couplings and fields are calculated.
Because the number of instances is far greater than 1 (B ≫ 1), these estimates of couplings are practically equal to the estimates (J MF = −Σ −1) in the mean field approximation, which was employed in GaussDCA (Baldassi et al. 2014).
The \((h_{i}^{\mathrm {NIW}}(a_{k})-h_{i}^{\mathrm {NIW}}(a_{q}))\) does not converge to \(\log P_{i}(a_{k})/P_{i}(a_{q})\) as J NIW → 0 but \(h_{i}^{\mathrm {MF}}(a_{k})- h_{i}^{\mathrm {MF}}(a_{q})\) does; in other words, the mean field approximation gives a better h for the limiting case of no couplings than the present approximation. Barton et al. (2016) reported that the Gaussian approximation generally gave a better generative model than the mean field approximation.
In GaussDCA (Baldassi et al. 2014), μ 0 and Λ∕κ were chosen to be as uninformative as possible, i.e., mean and covariance for a uniform distribution.
1.1.4 Pseudo-likelihood Approximation
Symmetric Pseudo-likelihood Maximization
The probability of an instance σ τ is approximated as follows by the product of conditional probabilities of observing \(\sigma _{i}^{\tau }\) under the given observations \(\sigma _{j\neq i}^{\tau }\) of all other sites.
Then, cross entropy is approximated as
where conditional log-likelihoods and ℓ 2 norm regularization terms employed in Ekeberg et al. (2013) are
The optimum fields and couplings in this approximation are estimated by minimizing the pseudo-cross-entropy, \(S_{0}^{\mathrm {PLM}}\).
Equation 9.38 is not invariant under gauge transformation; the ℓ 2 norm regularization terms in Eq. 9.38 favors only a specific gauge that corresponds to γ J∑lJ ij(a k, a l) = γ hh i(a k), γ J∑kJ ij(a k, a l) = γ hh j(a l), and ∑kh i(a k) = 0 for all i, j(> i), k and l (Ekeberg et al. 2013). γ J = γ h = 0.01 that is relatively a large value independent of B was employed in Ekeberg et al. (2013). γ h = 0.01 but γ J = q(L − 1)γ h were employed in Hopf et al. (2017), in which gapped sites in each sequence were excluded in the calculation of the Hamiltonian H(σ), and therefore q = 20.
GREMLIN (Kamisetty et al. 2013) employs Gaussian prior probabilities that depend on site pairs.
where \(P_{ij}^{0}\) is the prior probability of site pair (i, j) being in contact.
Asymmetric Pseudo-likelihood Maximization
To speed up the minimization of S 0, a further approximation, in which S 0,i is separately minimized, is employed (Ekeberg et al. 2014), and fields and couplings are estimated as follows.
It is appropriate to transform h and J estimated above into a some specific gauge such as the Ising gauge.
1.1.5 ACE (Adaptive Cluster Expansion) of Cross-Entropy for Sparse Markov Random Field
The cross entropy S 0({h i, J ij}|{P i}, {P ij}, i, j ∈ Γ) of a cluster of sites Γ, which is defined as the negative log-likelihood per instance in Eq. 9.14, is approximately minimized by taking account of sets L k(t) of only significant clusters consisting of k sites, the incremental entropy (cluster cross entropy) ΔS Γ of which is significant (|ΔS Γ| > t) (Cocco and Monasson 2011, 2012; Barton et al. 2016).
L k+1(t) is constructed from L k(t) by adding a cluster Γ consisting of (k + 1) sites in a lax case provided that any pair of size k clusters Γ 1, Γ 2 ∈ L k(t) and Γ 1 ∪ Γ 2 = Γ or in a strict case if Γ′∈ L k(t) for ∀Γ′ such that Γ′⊂ Γ and |Γ′| = k. Thus, Eq. 9.43 yields sparse solutions. The cross entropies S 0({P i, P ij|i, j ∈ Γ′}) for the small size of clusters are estimated by minimizing S 0({h i, J ij}|{P i, P ij}, i, j ∈ Γ′) with respect to fields and couplings. Starting from a large value of the threshold t (typically t = 1), the cross-entropy S 0({P i, P ij}|i, j ∈{1, …, N}) is calculated by gradually decreasing t until its value converges. Convergence of the algorithm may also be more difficult for alignments of long proteins or those with very strong interactions. In such cases, strong regularization may be employed.
The following regularization terms of ℓ 2 norm are employed in ACE (Barton et al. 2016), and so Eq. 9.43 is not invariant under gauge transformation.
γ h = γ J ∝ 1∕B was employed (Barton et al. 2016).
The compression of the number of Potts states, q i ≤ q, at each site can be taken into account. All infrequently observed states or states that insignificantly contribute to site entropy can be treated as the same state, and a complete model can be recovered (Barton et al. 2016) by setting \(h_{i}(a_{k})= h_{i}(a_{k^{\prime }})+\log (P_{i}(a_{k})/P_{i}^{\prime }(a_{k^{\prime }}))\), and \(J_{ij}(a_{k},a_{l})=J_{ij}^{\prime }(a_{k^{\prime }},a_{l^{\prime }})\), where “′” denotes a corresponding aggregated state and a potential.
Starting from the output set of the fields h i(a k) and couplings J ij(a k, a l) obtained from the cluster expansion of the cross-entropy, a Boltzmann machine is trained with P i(a k) and P ij(a k) by the RPROP algorithm (Riedmiller and Braun 1993) to refine the parameter values of h i and J ij(a k, a l) (Barton et al. 2016); see section “Boltzmann Machine”. This post-processing is also useful because model correlations are calculated.
An appropriate value of the regularization parameter for trypsin inhibitor were much larger (γ = 1) for contact prediction than those (γ = 2∕B = 10−3) for recovering true fields and couplings (Barton et al. 2016), probably because the task of contact prediction requires the relative ranking of interactions rather than their actual values.
1.1.6 Scoring Methods for Contact Prediction
Corrected Frobenius Norm (L 22 Matrix Norm), \(\mathcal {S}_{ij}^{\mathrm {CFN}}\)
For scoring, plmDCA (Ekeberg et al. 2013, 2014) employs the corrected Frobenius norm of \(J_{ij}^{\mathrm {I}}\) transformed in the Ising gauge, in which \(J_{ij}^{\mathrm {I}}\) does not contain anything that could have been explained by fields h i and h j; \(J_{ij}^{\mathrm {I}}(a_{k},a_{l})\equiv J_{ij}(a_{k},a_{l})-J_{ij}(\cdot ,a_{l})-J_{ij}(a_{k}, \cdot )+J_{ij}(\cdot , \cdot )\) where \(J_{ij}( \cdot ,a_{l})=J_{ji}(a_{l}, \cdot )\equiv \sum _{k=1}^{q}J_{ij}(a_{k},a_{l})/q\).
where “⋅” denotes average over the indicated variable. This CFN score with the gap state excluded in Eq. 9.47 performs better (Ekeberg et al. 2014; Baldassi et al. 2014) than both scores of FN and DI/EC (Weigt et al. 2009; Morcos et al. 2011; Marks et al. 2011; Hopf et al. 2012).
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Miyazawa, S. (2018). Prediction of Structures and Interactions from Genome Information. In: Nakamura, H., Kleywegt, G., Burley, S., Markley, J. (eds) Integrative Structural Biology with Hybrid Methods. Advances in Experimental Medicine and Biology, vol 1105. Springer, Singapore. https://doi.org/10.1007/978-981-13-2200-6_9
Download citation
DOI: https://doi.org/10.1007/978-981-13-2200-6_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2199-3
Online ISBN: 978-981-13-2200-6
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)