Embeddings from protein language models predict conservation and variant effects

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein. Supplementary Information The online version contains supplementary material available at 10.1007/s00439-021-02411-y.


Short description of Supporting Online Material
In this Supporting Online Material (SOM), we provide an extended analysis of our 9- (Fig. S3, Table S1) and 2-class ( Fig. S2, Table S1) residue conservation prediction performance and further substantiate the choice of our threshold for binary conservation prediction (Fig. S1). Furthermore, we evaluate ProtBert's (Elnaggar et al. 2021) amino acid reconstruction performance to facilitate our understanding of the model's mistakes during token reconstruction (Fig. S4, Fig. S5). The distribution of those probabilities is also added (Fig. S6). Additionally, we provide a more detailed overview of binary SAV effect prediction performance on data from PMD4k (Kawabata et al. 1999) (Table S2) and from four DMS experiments (DMS4) of human proteins (Findlay et al. 2018;Majithia et al. 2016;Matreyek et al. 2018) when binarized for different thresholds (Table S3, Table S4,  (Elnaggar et al. 2021) were used as input to a convolutional neural network (CNN; method dubbed ProtT5cons) to predict 9-class conservation scores. For binarizing the 9 classes into "conserved" and "non-conserved" different thresholds were used. The thresholds range from "Classes >1 cons." with non-conserved being "1" and conserved being "2-9" up to "Classes >8 cons." with nonconserved being "1-8" and conserved being "9". Panel A shows F1+ (Eqn. 6) for "conserved" predictions of ProtT5cons in blue and F1-(Eqn. 7) for "non-conserved" predictions in red at different thresholds. Panel B shows MCC (Eqn. 8) scores at different thresholds. The threshold of "5" (non-conserved "1-5" and conserved "6-9") had the highest MCC value and the best F1 scores when maximizing for both positive and negative classes. For detailed dataset and method descriptions see Datasets and Methods. Black bars mark the 95% confidence interval (Eqn. 12).

Fig. S2: Confusion matrix for 2-class conservation prediction on ConSurf
Evaluation of residue conservation prediction on the final ConSurf10k hold-out test set (519 protein sequences, 49.2% conserved, 50.8% non-conserved residues). 9-class predictions were binarized by mapping the classes to non-conserved (1-5) and conserved (6-9). Embeddings from ProtT5 were used as input to a convolutional neural network (CNN; method dubbed ProtT5cons). For detailed dataset and method descriptions see Datasets and Methods.
Appendix p. 5 Evaluation of 9-class residue conservation prediction on the final ConSurf10k hold-out test set (519 protein sequences) with 1 being highly non-conserved and 9 being highly conserved. Embeddings from ProtT5 were used as input to a convolutional neural network (CNN; method dubbed ProtT5cons). Most wrong predictions fall into neighboring conservation bins highlighted by the shade along the diagonal. The network did not learn to distinguish between the most variable classes 1 and 2. Instead, the network learnt to predict the majority class (class 1). For detailed dataset and method descriptions see Datasets and Methods.

Fig. S4: Confusion matrix for amino acid reconstruction on ConSurf
Evaluation of ProtBert's amino acid reconstruction performance on the final ConSurf10k holdout test set (519 protein sequences). During pre-training, ProtBert was optimized on reconstructing corrupted input tokens from non-corrupted sequence context (masked language modeling). Here, we corrupted and reconstructed all residues in all proteins of the ConSurf10 hold-out test set, one residue at a time. This confusion matrix shows the performance of ProtBert's ability to reconstruct the correct amino acid when picking the amino acid reconstructed with the highest probability. In total, ProtBert was able to reconstruct 43.5% of the amino acids in this set correctly, compared to a random baseline of 9.4% when picking always the most frequent amino acid. from embeddings Appendix p. 7 Here, we illustrated the over-and under-prediction of different amino acid types from ProtBert's mask reconstruction probabilities (Elnaggar et al. 2021). In brief, we used the ConSurf10k holdout set (519 protein sequences) to corrupted one residue at a time and used ProtBert to reconstruct it from non-corrupted sequence context. We divided the number of predictions per amino acid type (sum over rows in Fig. S4) to the number of true occurrences of this amino acid type in the hold out set (sum over columns in Fig. S4) and subtracted 1. As a result, values around 0 implied that an amino acid was predicted with the same frequency as it appeared in the test set, while values around 1 implied that observing the prediction of this amino acid was twice as likely as expected (compiled according to background frequency in dataset). Negative values, in turn, implied predicting this amino acid type less often than expected by chance. Amino acids shown on the x-axis are ordered by their background frequency in the data set (left: most frequent; right: least frequent).

Fig. S6: Distribution of ProtBert's reconstruction probabilities
In its original publication, ProtBert was trained on masked language modeling, i.e., reconstructing corrupted input tokens from non-corrupted sequence context. Here, we corrupted and reconstructed, one residue at a time, all residues of the entire ConSurf10k dataset. As a result, ProtBert returns a probability for each of the 20 amino acids for each residue position. The distribution shows that ProtBert tends to assign very high probabilities to very few amino acids (small peak around 1) while assigning relatively low probabilities to the other 19 amino acids. Ensemble for SAV effect prediction using residue conservation prediction from ProtT5cons, substitution scores from BLOSUM62, and the log-odds ratio of substitution probabilities from ProtT5 (dubbed ProtT5-logodds); b) VESPAl: fast version of VESPA using only residue conservation prediction from ProtT5cons, and substitution scores from BLOSUM62; c) ProtT5logodds: log-odds ratio of the substitution probability for a wild-type amino acid a specific mutant at the same position (see Methods -Substitution probabilities for more details); d) BLOSUM62: context-independent substitution scores from BLOSUM62. Absolute Spearman correlation (Eqn. 11) is shown for each method and experiment. The rightmost column shows the mean absolute Spearman correlation for each method.  protein sequences). 9-class predictions were binarized by mapping the classes to nonconserved (1-5) and conserved (6-9). Embeddings from three protein LMs (ESM-1b, ProtBert and ProtT5) were used as input to three classifiers (Logistic regression (LR), feedforward neural network (FNN) and convolutional neural network (CNN)). Additionally, we compared to an existing solution that identifies residue conservation using evolutionary information from multiple-sequence alignments (ConSeq). We found that the CNN trained on ProtT5 embeddings reaches a performance within the confidence interval of ConSeq. refers to the Pearson correlation coefficient. For detailed dataset and method descriptions see Datasets and Methods. ± values mark the 95% confidence interval (Eqn. 12).  Appendix p. 12 Appendix p. 13 Appendix p. 14 Appendix p. 15  (Riesselman et al. 2018)) with 135,665 SAV scores. Methods: a) DeepSequence trains an unsupervised model, i.e., no effect score labels are required, using only information from MSAs for each of the DMS experiments individually (Riesselman et al. 2018); b) GEMME uses statistical means together with expert domain knowledge to infer evolutionary trees and conserved sites from MSAs to predict mutational effects (Laine et al. 2019); c) ESM-1v: log-odds ratio of substitution probabilities (see Methods for details) are readily correlated with SAV effect magnitudes without further optimization on any SAV effect data (zero-shot learning) (Meier et al. 2021); d) VESPA (this work): logistic regression ensemble trained on binary effect classification (effect/neutral) using predicted residue conservation (ProtT5cons), substitution scores from BLOSUM62 (Henikoff and Henikoff 1992), and log-odds ratio of substitution probabilities from ProtT5 (Elnaggar et al. 2021); e) VESPAl: "light" version of VESPA using only predicted conservation and BLOSUM62 as input. pLM-based methods (ESM-1v, VESPA, VESPAl) always provide predictions, irrespective of MSA-depth while MSA-based methods (DeepSequence, GEMME) rely on MSA input and provide only predictions above a certain MSA depth/diversity.  (Henikoff and Henikoff 1992), and log-odds of substitution probabilities from ProtT5 (Elnaggar et al. 2021); c) VESPAl: "light" version of VESPA using only predicted conservation and BLOSUM62 as input. pLM-based methods (ESM-1v, VESPA, VESPAl) always provide predictions, irrespective of MSAdepth while MSA-based methods (DeepSequence, GEMME) rely on MSA input and provide only predictions above a certain MSA depth/diversity. The predictions of pLMbased methods are compared for the substitutions where MSA-based predictions are available (90% of DMS39) to the substitution where there are none (10% of DMS39) (see Table S8).

C Marquet et al & B Rost
Supporting online material: Variant effects predicted from embeddings Appendix p. 17  (Riesselman et al. 2018)) with 135,665 SAV scores. Methods: a) DeepSequence trains an unsupervised model, i.e., no effect score labels are required, using only information from MSAs for each of the DMS experiments individually (Riesselman et al. 2018); b) GEMME uses statistical means together with expert domain knowledge to infer evolutionary trees and conserved sites from MSAs to predict mutational effects (Laine et al. 2019); c) ESM-1v: log-odds ratio of substitution probabilities (see Methods for details) are readily correlated with SAV effect magnitudes without further optimization on any SAV effect data (zero-shot learning) (Meier et al. 2021); d) VESPA (this work): logistic regression ensemble trained on binary effect classification (effect/neutral) using predicted residue conservation (ProtT5cons), substitution scores from BLOSUM62 (Henikoff and Henikoff 1992), and log-odds ratio of substitution probabilities from ProtT5 (Elnaggar et al. 2021); e) VESPAl: "light" version of VESPA using only predicted conservation and BLOSUM62 as input. pLM-based methods (ESM-1v, VESPA, VESPAl) always provide predictions, irrespective of MSA-depth while MSA-based methods (DeepSequence, GEMME) rely on MSA input and provide only predictions above a certain MSA depth/diversity. The small coverage disparity between pLM-based methods and DMS scores (0.28%) occurred due to minimal deviances of input sequences derived from alignments provided by Riesselman et al. (2018). The column coverage of DMS scores shows the percentage of total DMS annotations for which predictions of each method were available.