Introduction

Protein–protein interactions (PPIs) are physical contacts of high specificity established between two or more proteins as a result of biochemical events steered by interactions that include electrostatic forces, hydrogen bonding, and the hydrophobic effect. These physical contacts are molecular associations between chains that occur in a cell or living organism in a specific biomolecular context. PPIs play a key role in biological processes [1, 2]. The knowledge of PPIs allows the deciphering of diseases and design of new drugs and therapeutic antibodies [3]. A typical PPI involves an interaction between a target protein and one or more binding proteins. For instance, the target protein may be part of a pathogen, such as the SARS-CoV-2 (Severe Acute Respiratory Syndrome-CoV-2) spike glycoprotein, while the binder may be a neutralizing antibody, such as the S309 antibody Fab fragment [4]. The resulting complex is illustrated in Fig. 1, where the antibody (binder) is depicted in green, while the spike (target protein) is shown in pink.

Fig. 1
figure 1

Structure of the SARS-CoV-2 spike glycoprotein (in pink) in complex with the S309 neutralizing antibody Fab fragment (in green)

Proteins are large biomolecules that comprise one or more long chains of amino acid residues. They perform a vast array of functions within organisms, including catalyzing metabolic reactions, replicating DNA, responding to stimuli, providing structure to cells and organisms, and transporting molecules from one location to another. Proteins differ from one another primarily in their amino acid sequence, which is dictated by the nucleotide sequence of the corresponding genes and usually causes protein folding into a specific 3-D structure that determines its activity [5].

The existence of interactions among proteins may be determined by techniques such as proteome chips [3], microarrays [6], and Strep-tag® [7]. Although outstanding progress has been made, these in vitro methods remain both time-consuming and expensive when a large number of proteins are involved [8, 9]. On the other hand, computational or in silico approaches, based on machine learning techniques, allow for the scanning of a large number of proteins in a limited amount of time [10,11,12,13,14,15,16,17,18]. Therefore, given a target protein, it is possible to rapidly screen a large number of binders with a PPI predictor to determine the most suitable candidate for binding. Nevertheless, the screening process is limited to known proteins and may not necessarily be suitable for all applications, e.g., the design of a therapeutic antibody for a new or orphan disease [19]. As a result, it is not possible to directly generate new proteins with a screening method. Therefore, there is a great need for solutions, especially those based on deep learning, that can predict non-existent binders in addition to known ones.

This study addresses this problem. More specifically, it is proposed, given the amino acid sequence of a target protein, to determine the amino acid sequence of the corresponding binder without recourse to domain knowledge such as structural information. In our work, we employ a novel transformer-based deep learning architecture to realize this objective.

As one of the most prominent deep learning algorithms, the transformer has achieved success in fields including natural language processing (NLP) [20], computer vision (CV) [21]. The transformer proposed is a Seq2seq algorithm for machine translation [22, 23]. In the English-to-German and English-to-French translation challenges, the transformer outperforms previous state-of-art approaches. [23]. According to Lin et al. [24], transformers have the following advantages: 1. Transformers can achieve any pair of positions in the sequence in one step. Compared to recurrent neural networks, this property provides transformers a better capability for handling long-range dependencies. 2. As for convolutional neural networks (CNN), a multi-layered stack is needed to obtain a global receptive field, whereas transformers achieve this in only a single self-attention layer. 3. Even though fully connected layers also possess the above ability, transformers are more efficient in terms of parameters. 4. Transformers have a constant number of sequential operations, whereas RNNs depend on the length of the sequence, so transformers are more parallelizable. With the above motivations, we employ transformers as the backbone model of our algorithm.

The reason for focusing our attention on amino acid sequences is that they may be experimentally determined rapidly and cheaply with high-throughput systems [25], in contrast to 3-D structures, which must be determined with slow and expensive methods such as X-ray crystallography [26].

Amino acid sequences may be assimilated to pseudo-time series, i.e., the time corresponding to the position of the amino acid in the chain, because they are oriented: the beginning of a chain is marked by an amino-terminus (N-terminus), while the end is marked with a carboxyl-terminus (C-terminus) [27]. As such, deep learning techniques related to time series and natural language processing (NLP) may be transposed to predict amino acid sequences. Therefore, it is proposed to employ a transformer [23] for predicting the binders, a transformer being a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data to predict the output.

Another problem specific to proteins is that there is usually more than one possible binder for a given target protein [28]. As a result, the problem is multivalued (having more than one solution) and some additional constraints must be applied to be able to learn a particular solution (binder). Two constraints are employed in our deep learning solution: the binding score, which is related to the strength of the interaction, and a prior about the amino acid sequence of the binder. The score is either appended to the amino acid sequence of the target protein (input of the encoder) or injected directly into the latent space of the encoder. In order to further constrain the problem, a prior is applied to the output of the decoder, i.e., to the binder amino acid. This prior consists of the first \(\textit{n}\) amino acids of the binder to ensure the uniqueness of the solution.

In this study, we proposed a primary sequence-based approach to design PPIs with two different transformers. We also address the multivalued problem in PPI design by introducing the binding score as well as a prior. Given 10% of the binder as a prior, both of the transformers are able to achieve a Needleman-Wunsch score of 0.86.

Our main contributions include (1) a novel deep-learning model that generates unique binders given a target protein, a binding score, and a prior on the binder, (2) two innovative transformer architectures that allow the binder generation to be conditioned by previous knowledge (conditional transformer), and (3) a new search strategy named Beam Search with Prior (BSP), which considerably reduces the computational complexity and error accumulation.

This paper is organized as follows: related work is covered in the following section, while the new transformer-based deep learning architectures employed for binder prediction are described in the next section. To the best of our knowledge, this is the first work that employs transformers for PPI design. The dataset and the preprocessing techniques are presented in the subsequent section and the transformer training procedure is described in “Network training”. The inference, the evaluation techniques, and the prior are addressed in “Inference, evaluation, and prior”, while the experimental results are reported in the subsequent section. Conclusion concludes the paper.

Related work

Protein design is the rational design of new protein molecules to design novel activity, behavior, or purpose and advance the basic understanding of protein function. Proteins may be designed from scratch (de novo design) or by making calculated variants of a known protein structure and its sequence (protein redesign). The most common methods for redesign are residue interactions clustering followed by high-affinity binding protein searching [29, 30], searching for an existing binding protein with high affinity that is subsequently optimized [31,32,33], and incorporating metal-binding sites while optimizing their periphery [34]. As for de novo design, the most common approaches are optimizing existing binding sites [35, 36] and grafting either one [37,38,39] or two sides [40] of an existing site. All these techniques rely on domain knowledge such as the ternary structures (3-D shape) and information about residue interactions and complexes. For instance, metal-binding sites are restricted to nonpolar proteins [34]. Frequently, information such as the ternary structure is unavailable, thereby considerably reducing the effectiveness and generalizability of these methods.

There are relatively few studies about protein design based on deep learning. Based on the criterion that if the study be related to sequence-based protein design, we select a few studies. Among the most important are signal peptide generation by attention-based neural networks [41], antibody design using a long short-term memory (LSTM)-based deep generative model from a phage display library for affinity maturation [42], sequence-based deep learning antibody design for in silico antibody affinity maturation [43, 44], improving the accuracy of neural network-based computational protein sequence design with DenseNet [45] using backbone structures, computational protein design with deep learning neural networks [46], and the generation of functional protein variants with variational autoencoders [47]. Wu et al. approach lacks the ability to solve multivalued problems, and all these methods require either knowledge of the 3-D structure [45] or domain-specific knowledge [43, 44].

An approach for solving the antibody-antigen binding problem is proposed in [43]. Given an antibody-antigen pair, a one-hot embedded matrix and an adjacent matrix are first evaluated. Then, the two matrices are fed to a neighborhood aggregation layer and a graph neural network (GNN) to obtain a binary classification. Being confined to binary classification, the Kang et al. method is unsuitable for determining the sequences of binding proteins.

A single-domain antibody (sdAb), also known as a nanobody, is an antibody fragment consisting of a single monomeric variable antibody domain. Like a whole antibody, it can bind selectively to a specific antigen. Shin et al. employ an autoregressive dilated convolutional model for predicting nanobody sequences [44]. As the model is autoregressive, the amino acids are predicted sequentially, with the prediction of a new amino acid being based on the previous ones. Wang et al. propose a framework, consisting of three cooperating deep learning models, that predicts the amino acid sequence of the binder based on the backbone structure (3-D) of the target protein [46]. The applicability of this framework is limited to target proteins for which the backbone structure is known.

In one study, a variational autoencoder is employed to generate new proteins [47]. In this work, by modulating the latent representation, it is possible to generate new proteins. Unfortunately, the autoencoder does not account for the target protein, making it unsuitable for interacting proteins.

The approach by [45] is based on DenseNet [48], which is a convolutional neural network with several parallel skip connections that was originally introduced in computer vision for diagnostics and identification [49, 50] and more recently was used to diagnose COVID-19 infections from computed tomography scans [51]. In the proposed architecture, DenseNet is extended to volumetric data [45]. The latter consists of regular volumetric grids in which each volume element is related to the atomic density, while each grid corresponds to a particular atomic type, thus focusing on the 3-D structure. The amino acid sequence of the target residues is predicted from the distribution of backbone atoms. Qi et al. achieved an accuracy of 53.2% with a 5-fold cross-validation, which constitutes an improvement of more than 10% compared to the state of the art. Unfortunately, because of the one-to-one correspondence between each amino acid and the backbone, the target protein and the binder must have the same number of amino acids, which strongly limits the applicability of this approach. Indeed, most of the time, the binder and the target protein do not have the same number of amino acids [52].

This limitation may be circumvented by employing an autoregressive model [53] in which the new amino acid is predicted recursively from the previous one. The recursive prediction process ends when an end-of-sequence (EOS) token is predicted. Therefore, the model can predict sequences with variable lengths, thus allowing for binders and targets not having the same number of amino acids, which is by far the most common scenario. Such a strategy was followed by Sake et al. [42] for antibody design against kynurenine, a metabolite used in the production of niacin. The autoregressive model consisted of an LSTM [54]. An extended dataset was obtained with the phage display laboratory technique [55] and an antigen-binding (Fab) library [56]. Sake et al. model can predict sequences with arbitrary lengths. Nonetheless, the predictions are based on a specific probability distribution, the one associated with the maturation of antibodies against kynurenine, which means that the approach cannot be easily generalized. A similar problem occurs in [44].

The study by Wu et al. is the one closest to our work [41]. The authors employ an autoregressive model to predict binders of variable length. Their training dataset consist of signal peptides (short amino acid sequences) and their corresponding target proteins, and their architecture is based on a seq2seq transformer. Recall that a transformer is a deep learning model that adopts the mechanism of self-attention [23]. As such, seq2seq transforms the target protein amino acid sequence into the binder amino acid sequence. The main components are the encoder and the decoder. The encoder converts each input sequence into a more informative and compact latent (hidden) representation. The decoder reverses this process, converting the latent representation into an output sequence using the previous output as the input context. The network includes two optimization mechanisms: attention and beam search. The input to the decoder is a single vector that stores the entire context. Attention allows the decoder to look at the input sequence selectively. As for beam search, instead of picking a single output (amino acid) as the output, multiple highly probable choices are retained while being represented as a data structure. For testing, de novo signal peptides were generated with the trained model. It was determined experimentally that 48% of the peptides bound effectively. These peptides shared on average 73% of their sequence with the corresponding closest known signal peptides.

Yet, there are some limitations to the method of Wu et al. [41]. First, it is essentially limited to small molecules, whereas our main concern is PPIs. Second, as opposed to signal peptides, PPIs are less specific. As a result, a given target protein may interact with multiple binders. Indeed, only 0.13% of proteins studied by Wu et al. interacted with more than one binder. The situation is completely different for PPIs. For instance, in the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) dataset [57], most proteins interact with at least four binders. The distribution of the number of interactions per protein for the STRING dataset is reported in Fig. 2. Therefore, protein design is made more complex by the fact that there are multiple solutions for a given target protein (multivalued problem). This is to be contrasted with signal peptides, for which the solution is essentially unique, which means that the problem is more constrained.

As indicated earlier, all state-of-the-art methods either require knowledge of the 3-D structure of the target protein or domain-specific knowledge. Therefore, they may encounter difficulties when applied to new problems, meaning that their generalizability is rather limited. In addition, very often, the 3-D structure is not readily available. On the other hand, amino acid sequences may be easily obtained. For these reasons, we introduce a solution for designing binder proteins that is based on amino acid sequences. Our approach aims at designing target proteins solely from amino acid sequences while addressing the multivalued problem inherent to PPIs. The architectures of our models are presented in “Transformer for binding protein prediction given a known target protein”.

Fig. 2
figure 2

Distribution of the number of interactions per protein in the STRING dataset

Dataset and preprocessing

The dataset employed for training and testing is the STRING dataset that was originally proposed by Szklarczyk et al. in 2019 [57]. For the species Homo sapiens, the dataset consists of 11,759,455 interacting protein pairs, each of which is characterized by a binding score. Amino acid sequences consisting of more than 256 amino acids are excluded because the computational requirements for the self-attention mechanism grow quadratically with the number of amino acids [58]. In addition, since we are interested in binding proteins with high affinities, only proteins for which the binding score is between 0.9 and 1.0 are retained [59,60,61,62].

Recall that a transformer architecture is asymmetric in the sense that the binder may become the target protein and vice versa. Therefore, the dataset was extended by inverting the binder and the target protein. As a results, 96,744 interacting pairs involving 2091 distinct proteins were obtained in conjunction with their binding scores. Since a seq2seq model is employed, the amino acids are assimilated to tokens, while the amino acid sequences are assimilated to sentences. As in NLP, each sequence starts with a BOS token and ends with an EOS token. The BOS token is employed to initialize the recursive prediction at the inference stage, while the EOS token marks the end of the sequence. Therefore, a sequence of arbitrary length may be predicted. Sequences with fewer than 256 amino acids are padded with padding tokens to be able to predict smaller amino acid sequences.

For AppendFormer, the binding score is discretized into 100 bins ranging from 0.9 to 1.0, with each bin corresponding to a particular binding score. The number of bins was determined through experimentation. As a result, the vocabulary needed to be increased by 100 tokens in addition to those associated with the amino acids. For MergeFormer, since the binding score is injected directly at the level of the latent space, no additional tokens are required. Network training and the priors are addressed in “Network training”.

Transformer for binding protein prediction given a known target protein

Recall that a transformer is a deep learning architecture that transforms one sequence into another and that it is based on seq2seq. It is an autoregressive model with an attention mechanism that was first proposed for neural machine translation [23]. As opposed to recurrent neural networks [63], no sequential treatment of the input vector is required. In addition, the scaled dot product attention mechanism densely connects the input and output. Therefore, the context vector is linked to the entire input, thus overcoming the problem associated with long-term dependencies [64]. As stated earlier, PPI is a multivalued problem since more than one binder may be associated to a target protein, as reported in Fig. 2. To obtain a particular solution, some prior knowledge must be introduced in order to constrain the solution to a particular binder. To achieve a unique solution, it is proposed to employ the binding scores between target proteins and binders in addition to prior knowledge of the amino acid sequence of the binder. Here, the prior refers to the first amino acids of the target protein that are known.

In our study, two transformer architectures are introduced: one is optimized for short priors while the other is optimized for longer priors. The first AppendFormer architecture extends the original transformer as introduced by Vaswani et al. [23] with the objective of learning in the presence of short priors. The input AppendFormer transformer consists of the amino acid sequence of the target protein and the discretized binding score that is appended in between the padding and the EOS tokens (Fig. 3). The EOS token is employed to determine the localization of the discretized binding score within the input sequence. The architecture of the AppendFormer is reported in Fig. 4. The transformer consists of an encoder and a decoder, where the encoder transforms the target protein amino acid sequence into a latent representation, while the decoder generates the amino acid sequence of the corresponding binder protein from the latent representation. The decoder has a recurrent architecture, which means that its output is reinjected into its input as illustrated in Fig. 4. The output of the decoder is initialized with a beginning-of-sequence (BOS) token as well as with a prior about the binder amino acid sequence that consists of its first \(\textit{n}\) amino acids. The Bayesian prior is employed to constrain the solution to a unique binder. Indeed, as stated earlier, because PPI is essentially a multivalued problem, it is necessary to introduce some constraints to obtain a unique solution. For both the encoder and the decoder, the amino acid sequences are processed with an embedding layer [23] that ensures that both sequences have a continuous representation while mapping amino acids with similar properties to the same region. As the attention mechanism does not account for the order in which the amino acids appear in the sequences, a positional encoder [23] is added after each embedding layer, thus allowing the transformer to account for the order in which the amino acids appear without jeopardizing parallel processing. The positional layer is followed by a multi-head self-attention layer. The aim of this layer is to capture the interrelations among the various amino acids such as self-folding and self-interaction either for the target protein or the binder, as demonstrated in [65]. Then, the latent representation is generated by a multi-head attention layer. For this layer, the query keys are associated with the decoder, while the memory and value keys are associated with the encoder, thus allowing the capture of the interrelations between the amino acids of the target protein (receptor) and those of the binders. The multi-head attention layer is followed by a classifier that consists of a feed-forward neural network, a linear, and a softmax activation function. The classes correspond to the twenty standard amino acids as well as to a potentially unknown amino acid (multi-class problem). During the training phase, the classifier allows for determining if the amino acid generated by the decoder is the right one. Various skip connections [66] are added along the way to reduce the complexity of the loss function landscape, therefore facilitating its optimization [67].

All the amino acid sequences employed in the experiments come from the Protein Data Bank. As such, they were determined experimentally and curated. When sequencing the amino acids, if the procedure is inconclusive for a particular amino acid, the latter is considered unknown and labelled as X. These unknown amino acids constitute the noise, or better said, the uncertainty associated with the data. Therefore, the unknown amino acids are also employed in addition to the twenty standard amino acids during training. This allows the transformer not only to handle unknown data, but also to assess if the experimental determination of an amino acid may be potentially challenging when predicting new sequences.

We also introduce the MergeFormer architecture, designed for longer priors, i.e., a scenario in which a larger portion of the target protein is known. In the second architecture, a merge layer is introduced between the encoder and the decoder to inject the binding score deeper into the encoder just after the feed-forward neural network. Therefore, the binding score is decoupled from the amino acid sequence as opposed to the first architecture. The MergeFormer architecture is described in Fig. 5. The input of the encoder consists of the amino acid sequence of the target protein. The network is designed in such a way that the effect of the binding score may be learned independently for each amino acid. Therefore, a constant vector is created, the constant being the binding score, which has the same dimension as the latent or hidden representation. This vector is concatenated to the latent representation, the resulting vector is transposed, and a linear transformation is applied, thus allowing for the learning of the interrelations between the binding score and the amino acids. Finally, the linearly transformed representation is transposed to restore the original topology. The main advantages of MergeFormer are:

  • No vocabulary is required for the binding affinity, thus reducing the number of tokens that must be learned by the network.

  • For AppendFormer, the discretized binding score is appended at the end of the protein sequence. Since the amino acid sequence may have an arbitrary length, the position of the binding score is unknown and therefore must be learned by the network. In the case of MergeFormer, the binding score is injected directly into the latent space while its relation to each amino acid is learned.

  • The fact that the binding score is independent of the amino acid sequence allows for initializing the encoder weights with techniques such as masked language modeling (MLM), which has been shown to substantially improve performances in NLP when employed in conjunction with the bidirectional encoder representations from transformers (BERT) [68].

Fig. 3
figure 3

Example input for encoder and decoder in AppendFormer

Table 1 Hyperparameters associated with the transformers: \(n_{encoder}\) and \(n_{decoder}\) represent the number of layers in the encoder and the decoder, respectively, \(d_{model}\) and \(d_{ffn}\) are the dimensions of the embedding and feed-forward neural network, respectively, and \(P_{drop}\) is the dropout probability. The number of heads in the multi-head attention mechanism is given by \(n_{head}\), while the size of the input and output vocabulary is represented by \(v_{input}\) and \(v_{output}\), respectively. For MergeFormer, these are equal as they correspond to the sum of the number of amino acids, the number of special amino acids, and the number of special tokens. For AppendFormer, \(v_\text {input}\) is larger by the number of tokens
Fig. 4
figure 4

Architecture of AppendFormer.The digitized binding score BS is inserted at the end of the sequence. The number N of encoder layers was 3 in this study

Fig. 5
figure 5

Architecture of MergeFormer. The input is the amino acid sequence of the target protein, while the binding score is concatenated with the latent representation. \(N=3\) in this study

The hyperparameters employed for both transformers are reported in Table 1. The table consists of four transformers, namely one MergeFormer architecture and three AppendFormer architectures, with the difference in the three AppendFormer architectures being the number of neurons in the feed-forward neural network.

To further evaluate the proposed transformers, their performances are compared against a seq2seq baseline with LSTM layers [22]. We employ this model as a baseline since it famously surpassed the state of the art for machine translation in the Conference on Machine Translation (WMT’14) English to French Challenge [69]. To date, it remains one of the most widely used seq2seq models and has been successful in a wide range of applications [70,71,72]. Specifically, it has also been successfully adopted in protein design applications [73,74,75].

Its architecture is described in Fig. 6. As for AppendFormer and MergeFormer, baseline model consists of an encoder and a decoder. The key difference is that the attention layers are replaced by LSTM layers. The number of neurons in the LSTM layers were set to 512 by inspection. The baseline model in AppendFormer and MergeFormer possesses embedding layers that convert the amino acid sequences into hidden or latent representations that are subsequently processed by the LSTM layers. The LSTM layer associated with the decoder is initialized from the internal states of the encoder’s LSTM layer. These internal states consist of the cell state and the hidden state. The decoder’s LTSM autoregressively generates the amino acid sequence for the binder, given the target protein. In AppendFormer and MergeFormer, the amino acids are predicted with a linear layer employing a softmax activation function.

Fig. 6
figure 6

Architecture of the seq2seq baseline model

Network training

Teacher forcing, which is an essential part of many seq2seq architectures [76], is employed at the training stage. Being an autoregressive model, a transformer predicts the next amino acid from the latent representation of the input sequence (target protein) and the previous predictions of the decoder. In teacher forcing, during training, the predicted value at position t is replaced by the true value if the prediction is erroneous. As indicated earlier, given a target protein, the solution in terms of binders is multivalued. Therefore, due to the multiplicity of solutions, an error may be made when predicting an amino acid. The model being autoregressive, this error may jeopardize subsequent predictions in addition to causing error accumulation. Teacher forcing avoids the accumulation of errors by providing the decoder with the true input, resulting in a faster convergence during the training phase.

With teacher forcing, the training may be drastically accelerated. Without teacher forcing, the predictions have to be made recursively because the decoder’s output is reinjected into the decoder’s input. This means that the prediction for the current position is always based on the previous one. The recursive process eliminates any possibility for parallel training. However, with teacher forcing, the decoder is provided with ground truth labels rather than the model’s previous predictions. Therefore, the sequence prediction may be partitioned into multiple independent sub-tasks (parallelization). These sub-tasks can be distributed to many nodes and calculated concurrently, thus accelerating the training process. The loss function is optimized with the Adaptive Moment Estimation (ADAM) stochastic optimizer technique [77].

For the loss function, with teacher forcing training, each amino acid prediction on the chain may be performed independently, therefore resulting in independent multiclass classification problems, with the classes corresponding to the amino acid types. It follows that the loss function is a sparse categorical cross entropy [78] that is described by Eq. 1. w is the weights of the model. The output of the model and the ground-truth label are noted as \({\widehat{y}}\) and y respectively. As in [23], the padding tokens are ignored during training.

$$\begin{aligned} J(w)= -\frac{1}{N}\sum _{i=1}^{N}y_{i}\textrm{log}({\widehat{y}}_i) \end{aligned}$$
(1)

As for the original transformer, a dynamic learning rate with warm-up and decay is employed as it improves convergence during training. We use the same for our AppendFormer model since it is similar to the original transformer architecture as shown in Eq. 2. For our MergeFormer model, we design a dynamic learning rate with spiking followed by a constant learning rate, as described by Eq. 3.

$$\begin{aligned}{} & {} l_{\textrm{appe}} = d^{-0.5}_{\textrm{model}} \cdot \textrm{min}(n_{\textrm{s}}^{-0.5}, n_{\textrm{s}} \cdot n_{\textrm{ws}}^{-1.5}) \end{aligned}$$
(2)
$$\begin{aligned}{} & {} l_{\textrm{merg}}=\left\{ \begin{matrix} l_{\textrm{init}} \cdot \frac{n_{\textrm{s}}}{n_{\textrm{ws}}} &{} n_{\textrm{s}}\leqslant n_{\textrm{ws}} \\ \textrm{max}(l_{\textrm{init}} \cdot \frac{1}{1 + \alpha \cdot (n_{\textrm{s}} - n_{\textrm{ws}})}, \epsilon ) &{} n_{\textrm{s}}> n_{\textrm{ws}} \end{matrix}\right. \end{aligned}$$
(3)

where \(n_{\textrm{s}}\) represents the number of training steps, \(n_{\textrm{ws}}\)the number of warming steps, \(\epsilon \) the lower boundary, \(\alpha \) the learning rate decay speed, and \(l_{\textrm{init}}\) the initial learning rate.

The learning rate for MergeFormer is slightly different. To avoid local minima, both the lower boundary and the initial learning rate must be smaller [79]. The learning rates for both AppendFormer and MergeFormer are reported in Fig. 7.

Fig. 7
figure 7

Dynamic learning rates for AppendFormer and MergeFormer

Inference, evaluation, and prior

Recall that both AppendFormer and MergeFormer are autoregressive models, which means that the prediction of a new amino acid is based on the previous one. The encoder generates a latent representation from the target amino acid sequence and the binding score, the latter being appended to the sequence in AppendFormer or injected into the latent space in the case of MergeFormer. The latent representation acts as a key and a value for the multi-head attention mechanism associated with the decoder.

Initially, when the amino acid sequence of the binder is completely unknown, the sequence is initialized with a BOS. Then, the decoder predicts the first amino acid. Being a multiclass classifier, the decoder predicts an occurrence probability for each of the twenty amino acid types or classes. The amino acid with the highest probability is selected with a greedy search strategy [80]. The output sequence is updated with the outcome of the greedy search, and the recursion is repeated until an EOS is generated by the discriminator, allowing for the prediction of sequences of arbitrary length.

The greedy search algorithm only retains the amino acid with the highest classification probability. Because there is more than one binder for a given target, many correct solutions are possible, at least at the beginning of the prediction process. The real and unique solution may only be inferred when more amino acids are predicted along the way. This problem may be overcome with beam search [81]. Instead of considering only the first \(b_w\) predicted amino acid, beam search considers the first predicted amino acid, with the last one being conditional on the first \(b_w - 1\). For instance, for \(b_w = 2\), the probabilities associated with both amino acids (bi-gram) are multiplied to obtain the conditional probability of the bi-gram, with the probability of the second amino acid being conditional upon the first one. Then, the bi-gram with the highest conditional probability is selected and the process is repeated for each new prediction. As a result, the results are improved, and the issues associated with the multivalued problem are mitigated.

To further constrain the multivalued problem, a prior is attributed to the binder. Such priors are common in protein design, especially when mutation-based techniques are employed [35,36,37,38,39,40]. Their application is based on the observation that, without the prior, the transformer would not have enough information, or constraints, to converge onto a particular solution [82]. In our work, a prior consists of the first n amino acids forming the binder for both AppendFormer and MergeFormer. In the first case, the prior is appended to the input of the AppendFormer, while, in the second case, the prior is injected directly into the latent space between the encoder and the decoder. Depending on its length, the prior may be assimilated to a peptide or a polypeptide. The prior may be assimilated as a form of teacher forcing. Indeed, for the first n amino acids of the binder, the amino acids generated by the decoder are replaced by the corresponding amino acids of the prior. From the \(n + 1\) prediction, the amino acid is directly predicted via the decoder with beam search. Additionally, during the predicting stage, teacher forcing is only applied within the section of prior. During training, in contrast, teacher forcing is engaged in the entire process.

Because the ground truth is known for the prior, it is not involved in the evaluation. Therefore, all the evaluation metrics ignore the prior and only focus on the amino acid directly generated by the decoder, thus avoiding the introduction of bias in the evaluation procedure. Five metrics are employed for the evaluation, namely the token matching rate (TMR), bilingual evaluation understudy (BLEU), Needleman-Wunsch, Smith-Waterman metrics, and the Metric for Evaluation of Translation with Explicit ORdering (METEOR).

  • Token matching rate: The token matching rate is evaluated over the whole test set. It corresponds to the ratio between the total number of tokens (amino acids) correctly predicted to the total number of tokens. The metric ranges between 0 and 1.

    $$\begin{aligned} TMR = \frac{n_{\mathrm {token\_match}}}{n_{\textrm{token}}} \end{aligned}$$
    (4)
  • BLEU metric: The BLEU score [83] is a metric that was originally developed for evaluating the quality of texts that have been machine-translated from one natural language to another. In this study, the BLEU score is the logarithm of the geometric mean of the modified n-gram precisions for the amino acid prediction, with an n-gram corresponding to a contiguous sequence of amino acids of length n. Short n-grams are penalized with a brevity penalty score [83]. Here, the value of \(\textit{n}\) ranges between 1 and 4, while the BLEU score lies between 0 and 1. The metric depends only on the n-grams and not on their positions within the sequence. Therefore, the BLEU score determines the similarity between the true and predicted sequence in terms of their n-grams. The BLEU score is also related to the protein structure quality. Indeed, the number of amino acid substitutions is significantly correlated with neighborhood preference among amino acids [84]. As reported by [85], the neighborhood preference may be employed to assess structural quality.

  • Needleman-Wunsch metric: The Needleman-Wunsch algorithm determines the best global alignment and similarity measure between two protein sequences with dynamic programming [86]. Given an amino acid sequence, each correctly predicted amino acid is attributed a reward of 1, while a penalty of 0 is attributed otherwise. The similarity measure is the ratio between the sum of the rewards and penalties and the total number of amino acids. This is a number between 0 and 1 whose value is independent of the number of amino acids forming the sequence.

  • Smith-Waterman metric: The Smith-Waterman algorithm [87] is based on local alignment, which compares amino acid segments of all possible lengths and optimizes the similarity measure with dynamic programming. The measure itself is evaluated in the same way as for the Needleman-Wunsch metric. It is a number between 0 and 1 whose value is independent of the number of amino acids forming the sequence.

  • METEOR: Like the other metrics, METEOR measures the similarity between the generated and the targeted sequences. The main difference with the other metrics is that METEOR first aligns the two sequences and then calculates the harmonic mean of unigram precision and recall while putting more emphasis on recall [88]. From an information retrieval point of view, it enables stemming and synonymy matching, this is impossible with the other metrics [88].

Table 2 Number of trainable parameters, 5-fold cross-validation accuracy and top-5 accuracy with teacher forcing

To further ascertain the performances of the transformers, the classification accuracy and the top-5 classification accuracy are evaluated. Absorbed from classification tasks, Top-5 accuracy means that the predicted amino acid is among the five best results in terms of prediction probability.

The hyperparameters of the transformers were determined by evaluating and comparing the performances of the models for various values of the hyperparameters. As stated earlier, the sequences are evaluated recursively, meaning that the training process cannot be parallelized making the evaluation of the hyperparameters computationally prohibitive. For this reason, as explained in “Network training”, teacher forcing was employed to train each model. Indeed, teacher forcing allows the prediction of each amino acid independently of the others, thus allowing for parallelization of the calculations which reduces the time required for training [89].

Our experimental results are presented in the next section.

Experimental results

Fig. 8
figure 8

Evaluation of the performances of AppendFormer, MergeFormer, and the seq2seq baseline model according to the importance of the prior for five metrics: the token matching rate, BLEU, Needleman-Wunsch, Smith-Waterman, and METEOR

The best models and hyperparameters were determined from the accuracy and the top-5 accuracy as obtained by 5-fold cross-validation with teacher forcing. Recall that teacher forcing was employed to parallelize amino acid predictions, thus allowing for faster convergence. The best performances for AppendFormer (AppendFormer-l) were obtained with a feed-forward neural network consisting of 1792 neurons. The number of parameters, the accuracy and the top-5 accuracy for each model are reported in Table 2. While MergeFormer and AppendFormer-l have comparable performances in terms of accuracy, MergeFormer has much less parameters than AppendFormer-l, thus reducing the risk of overfitting while facilitating convergence during the training phase. AppendFormer-l has slightly better performance in terms of top-5 accuracy than MergeFormer, while the situation is the opposite for accuracy.

The models that obtain the best-three accuracy-Append-Former-l and MergeFormer-and the baseline seq2seq model were evaluated with the token matching rate, the BLEU, Needleman-Wunsch, Smith-Waterman metrics, and METEOR. It should be noted that teacher forcing was not employed during the testing phase as it would bias the results by providing the network with the correct solution. These metrics were evaluated for different priors on the binding proteins. The priors were obtained by providing the decoder with a certain percentage of the first amino acids forming the binder. This number was restricted in between 0% and 30% of the total length. As stated in Sect. 4, this prior is required in order to constrain the multivalued problem as multiple binders may bind to a given target protein. Therefore, the prior ensures that the right binder is found for a particular interaction.

Fig. 9
figure 9

Evaluation of the performances of AppendFormer, MergeFormer, and the seq2seq baseline model according to the importance of the prior for four metrics, namely the token matching rate, BLEU, Needleman-Wunsch, Smith-Waterman, and METEOR with the C2 evaluation procedure

The results for the four metrics are reported in Fig. 8. First, we turn our attention to the Needleman-Wunsch and Smith-Waterman metrics since they are widely used in bioinformatics. The best performances in terms of the Needleman-Wunsch and Smith-Waterman metrics are obtained with AppendFormer-l when no prior is employed, the metrics being equal to 0.58. This value increases to 0.61 for AppendFormer-l with a 5% prior, 0.86 with a 10% prior (both for AppendFormer-l and MergeFormer) and up to 0.95 with a 30% prior with MergeFormer. AppendFormer-l slightly outperforms MergeFormer when the prior is inferior to 10%, while MergeFormer strongly outperforms AppendFormer-l otherwise, improving performances by up to 40% when a 30% prior is employed. In comparison, as illustrated by Fig. 8, the seq2seq baseline model has relatively poor performances with the Needleman-Wunsch and Smith-Waterman metrics, barely reaching 0.65 when a 30% prior is employed. These results show that AppendFormer is suited for short priors (less than 10% of the entire sequence), while MergeFormer performs better with longer priors (more than 10%), thus confirming the validity of our architectures. Both transformers outperform the baseline seq2seq model. The performances in terms of Needleman-Wunsch and Smith-Waterman metrics are quite similar, meaning that no local alignment is required, and thus, that the sequence is properly predicted up to a global alignment. The same general behavior is observed for the token matching rate, BLEU, and METEOR results. Although these three metrics were reported for completeness, as stated above, the Needleman-Wunsch and Smith-Waterman metrics are the most relevant from a bioinformatics perspective. In contrast, researchers familiar with NLP evaluation metrics utilize token matching, BLEU, and METEOR scores more often.

Park and Marcotte [90] and Hamp and Burkhard [91] propose three evaluation procedures- C1, C2, and C3. In C1, it is assumed that some ligands and receptors appear in the training and validation sets; in C2, only some receptors appear in both groups; and in C3, no receptors are present in the validation set. The C1 procedure corresponds to the original experimental setup. Regarding the C2 procedure, we created a new dataset and conducted another round of experiments against it. This dataset was created by removing protein pairs for which the receptor was present in the training and validation sets. As reported in Fig. 9, the obtained results are comparable with those achieved with C1. For instance, for the bioinformatics metrics (Needleman-Wunsch and Smith-Waterman), a value of 0.86 was obtained, assuming a 10% prior on the ligand. In comparison, a value of 0.88 is reached with the C1 counterpart (Fig. 8). Besides, the best results for C1 are achieved with MergeFormer, while AppendFormer performs better with C2. That may be explained by the fact that the learnable alignment procedure associated with MergeFormer is more challenging when none of the ligands is present in both sets. Such an alignment procedure is absent from the AppendFormer model. The C3 evaluation procedure is not an appropriate choice in our experimental setting. In our work, the prior becomes challenging if none of the receptors are present in the training and validation sets. The reader should notice that the need for a prior corresponds to real-world settings. The receptor, which usually corresponds to the disease (e.g., a SARS-CoV-2 spike protein), is known in practice. In contrast, the ligand, which corresponds to a new therapeutic protein, is either unknown or partially known. Therefore, some receptors may appear in both sets without compromising the prediction of new therapeutic proteins.

Indeed, these results further highlight the successes of AppendFormer and MergeFormer.

Conclusions

This paper introduced novel deep learning architectures for therapeutic antibody and drug design based solely on amino acid sequences. A new primary sequence based approach for predicting the amino acid sequence of the corresponding binder from a given target protein amino acid sequence was introduced. As a target protein may interact with more than one protein, the problem was constrained by employing the binding score and a prior on the binder. The binder was predicted using the transformer deep learning approach. Two architectures were created: a transformer in which the discretized binding score is appended to the target amino acid sequence and a transformer for which the binding score is injected directly into the latent space between the encoder and the decoder, resulting in fewer parameters and a learnable relation between the binding score and the latent space. To improve convergence and parallelize the calculations, training was performed with teacher forcing. The best results were obtained with AppendFormer-l when the prior was less than 10.

A limitation of the proposed approach is the reliance on the prior. Indeed, the receptor (target protein) sequence and a portion of the ligand (binder protein) sequence must be known to predict the unknown part. It is often the case in practice but may become more challenging when little information is known. In addition, the prior restricts the number of solutions to one. Sometimes, a plurality of solutions is desirable, especially when toxicity issues arise with some generated proteins.

In the future, it is proposed to generate binders with a generative adversarial network (GAN) [92] in which the generator would be a transformer and the discriminator would be a pyramid neural network [65]. The discriminator would determine if the target and binder proteins interact or not. In this extended framework, the transformer would not require any prior information. Rather, all the binders would be generated directly by the GAN, thus overcoming the problems associated with multivalued solutions in addition to providing the most possible solutions.

As for most transformer-based architectures, the computational complexity bottleneck comes from the self-attention mechanisms, which grow quadratically with the number of amino acids [58]. This is why the number of amino acids was limited to 256. In future work, it is proposed to employ attention mechanisms with linear complexity, such as those described in [93] and [94], to handle very long sequences.