Introduction

Since the outbreak of severe acute respiratory syndrome in 2003 and Middle East respiratory syndrome in 2012 caused by SARS-CoV and MERS-CoV, respectively, viruses from the genus Coronavirus came into the focus of the scientific community [1,2,3,4,5,6]. Unfortunately, the appearance of a new member of the Coronaviridae family, SARS-CoV-2, in December 2019 in Wuhan, Hubei province, China, caused a global pandemic with significant effects on the health, social, economic, and environmental domains [7,8,9,10]. Recently, the first efficient vaccines against COVID-19, a disease caused by the SARS-CoV-2 virus, have become available [11,12,13]. Nevertheless, FDA has approved only one medicine, the antiviral drug remdesivir, for treatment of COVID-19 requiring hospitalization [14]. Design and development of SARS-CoV-2 drugs is ‘a hot potato’ in nowadays science and pharmaceutical industry. 3C-like protease (3CLpro), papain-like protease (PLpro), nonstructural protein 12 (nsp12) and RNA-dependent RNA polymerase (RdRp) have been selected as the main potential drug targets [15, 16].

The sequence of 3CLpro enzyme of SARS-CoV-2 and SARS-CoV has a high level of similarity (96%) [17]. 3CLpro is a homodimeric enzyme with an essential role in viral replication and transcription [18]. In a dimer, only one protomer demonstrates catalytic activity [19, 20]. With 306 residues, the protomer’s three-dimensional structure is usually divided into three domains (Fig. 1) [21, 22]. An antiparallel β-barrel is the main secondary structure motif of domains I (residues 8–101) and II (residues 102–184). On the other hand, five α-helices form compact antiparallel globular domain III (residues 201–303), connected to domain II by long linker group (residues 185–200). The first seven residues from the N-terminus side are forming the N-finger, which is believed to have a significant role both in dimerization and in establishing proteolytic activity [23, 24]. In the heart of the binding site, located in a cleft between domains I and II, is a conserved histidine-cysteine catalytic dyad. Here, Cys145 has a role of a nucleophile in the first step of the proteolysis, while His41 is the base catalyst [25]. The existence of allosteric binding sites in the groove between domains II and III has been proposed, with the possibility to stop the dimerization and prevent enzyme maturation [26, 27].

Fig. 1
figure 1

The structure of SARS-CoV-2 3CLpro in homodimeric form (left) and two perspectives of the monomer (right). The N-finger (residues 1–7) is depicted in dark blue, domain I (residues 8–101) in cornflower blue, domain II (residues 102–184) in orange, the loop region (residues 185–200) in green, domain III (residues 201–306) in red

The beginning of the global COVID-19 pandemic required immediate response—initial idea was to repurpose already approved drugs and use them to fight SARS-CoV-2. Chiou et al. [28] performed screening of 774 FDA-approved drugs against 3CLpro activity. Ethacrynic acid, naproxen, and allopurinol were shown as the most potent SARS-CoV-2 3CLpro inhibitors, with IC50 values below 5 μM. Combined docking and MM-PBSA study [29] of six available anti-HIV drugs, which act as HIV-1 protease inhibitors, identified indinavir and darunavir as potential anti-COVID-19 drugs. At the same time, clinical trials for lopinavir–ritonavir cocktail have shown the combination to be ineffective for the treatment of severe COVID-19 cases [30]. Alamri et al. [31] combined pharmacoinformatics with molecular dynamics studies to reveal potential covalent inhibitors capable of binding to the thiol group of Cys145. Additionally, they proposed paritaprevir and simeprevir, anti-hepatitis C virus drugs acting as NS3/4A serine protease inhibitors, as best hits from FDA-approved drugs list for clinical trials to fight COVID-19. Khan et al. [32] performed docking and molecular dynamics simulations to propose paritaprevir and raltegravir as lead candidates for inhibition of SARS-CoV-2 3CLpro, and dolutegravir and bictegravir for inhibition of 2′-O-ribose mathylotransferase. Alternative approach seeks inspiration from nature, from microbial natural products [26] to phytocompounds [33,34,35,36], to pinpoint active compounds useful for treatment of COVID-19 patients.

Structural and evolutionary analysis of SARS-CoV and SARS-CoV-2 main proteases indicated that the design of novel inhibitors or repurposing existing ones might be challenging [37]. Detailed molecular dynamics simulations of 3CLpro reveal several difficulties one has to be aware of in designing 3CLpro inhibitors. Although SARS-CoV and SARS-CoV-2 main proteases differ by 12 residues located mostly on the enzymes’ surface, both shape and size of the binding site experience significant changes, due to its flexibility and plasticity. The encouraging result of this study is the identification of a small number of residues with significant contribution to the protein stability—a potential target for a new class of inhibitors. Recently, Chen et al. [38] gave an overview of potential inhibitors of SARS-CoV, MERS-CoV, and SARS-CoV-2 main proteases in lower μM and sub-μM regimes. Because of the lack of antivirus activity of peptide-like 3CLpro inhibitors in animals due to interactions with host proteins, they suggested small molecular inhibitors, with higher solubility and lower cytotoxicity. In the same line is work by Zhang et al. [39] who reported a series of noncovalent 3CLpro inhibitors with 20 nM potency.

Great variety of artificial intelligence and machine learning methods were exploited with the aim to repurpose existing or to design new and effective anti-COVID-19 drugs [40,41,42,43,44,45]. Nand et al. [40] trained a decision stump machine learning predictive model to reduce a dataset of 1528 anti-HIV compounds to 356 compounds with strong bioactivity against 3CLpro. Then, in a series of steps, which included Lipinski’s rule of five filter, molecular docking, application of deep learning model, structural activity relationship analysis, and molecular dynamics simulations, two molecules—CID-230119 and CID-948801 were identified as hit compounds. Mohapatra et al. [45] designed a machine learning model based on the Naïve Bayes algorithm with 73% accuracy, and among FDA approved drugs found antiviral drug amprenavir as the most effective for the treatment of COVID-19. Except repurposing anti-HIV drugs against 3CLpro, Khan et al. [46] performed virtual screening of Traditional Chinese medicines database. For top hits compounds based on docking scores, they performed 100 ns molecular dynamics simulations. Based on RMSD, RMSF, and binding free energy (estimated using MM/GBSA approach) analysis, saquinavir and TCM5280805 emerged as compounds with the highest potential inhibitory role within the screened database. Kumar and Roy developed multiple linear regression (MLR) model and identified structural descriptors contributing to the increase and decrease of inhibitory potential [47]. Janairo et al. [48] build MLR, support vector regression (SVR), classification and regression trees (CART), random forest, and artificial neural networks (ANN) models predicting binding free energies and compared their performances. Exploiting five topological descriptors, the MLR model achieved the best score with R2 being 0.81. Among several developed machine learning models, Kumari and Subbarao pointed to the convolutional neural network (CNN) one, as being the most potent one in the binary classification of anti-SARS molecules [49].

In this paper, we calculated the recently introduced radial distribution function (RDF) weighted by the number of valence shell electrons [50,51,52] for 159 experimentally determined structures of SARS-CoV-2 3CLpro complexed with different ligands. The structural advantages of RDF, like it unambiguously describes 3D structures, its independence of the size of a molecule, and being invariant against translation and rotation of a molecule, are upgraded with electronic properties characteristic for each atom in a molecule. After design and careful validation of a model capable of predicting bioactivity against SARS-CoV-2 3CLpro, the activity was predicted for 6407 FDA approved and experimental compounds, revealing potential inhibitors of the main protease within the DrugBank database [53, 54].

Methods

Radial distribution function

The Protein Data Bank [55, 56] was queried for SARS-CoV-2 3CLpro (the main protease) structures with small molecules (ligands) bound in the active site. Our in-house software was used to manipulate downloaded pdb files. All water molecules, ions, additives, and small molecules outside the active site were removed. Since we are interested in estimating the similarity of the enzyme itself, we also deleted ligands. If the experimental structure was resolved for a homodimeric complex, only chain ‘A’ was retained. In the case when there were several structures with the same ligand, the structure with better resolution entered our dataset. Structures with two or more missing residues were not considered. For all main proteases from our dataset radial distribution function (RDF) weighted by the number of valence shell electrons (g[r]) were calculated.

Briefly, the RDF vector, whose size is defined by the distance of its two most distant atoms (rMAX), represents each structure [50, 57]. The elements of the vector are g(r) values, calculated in 0.1 Å intervals:

$$ g\left( r \right) = \mathop \sum \limits_{i = 1}^{N - 1} \mathop \sum \limits_{j > i}^{N} p_{i} p_{j} e^{{ - a_{ij}^{ - 2/3} \left( {r_{ij} - r} \right)^{2} }} $$
(1)

aij is the sum of atomic polarizabilities of atoms i and j, rij the distance between atoms i and j, N is the number of atoms in a molecule and the preexponential factors pi and pj account for the number of outer electrons of the i-th and j-th atoms, correspondingly. If we define average RDF, \(g^{{{\text{avg}}}} \left( r \right)\), as

$$ g^{avg} \left( r \right) = \frac{1}{n}\mathop \sum \limits_{A = 1}^{n} g_{A} \left( r \right) $$
(2)

where n is the number of structures in dataset, then the similarity index, σA, can be estimated

$$ \sigma_{A} = \frac{{\mathop \int \nolimits_{0}^{{r_{{{\text{MAX}}}} }} \min \left( {g_{A} \left( r \right),g^{{{\text{avg}}}} \left( r \right){\text{d}}r} \right)}}{{\mathop \int \nolimits_{0}^{{r_{{{\text{MAX}}}} }} g^{{{\text{avg}}}} \left( r \right){\text{d}}r}} $$
(3)

In that case, the similarity between RDFs of structure A and the averaged RDF is the overlapped area divided by the area under averaged RDF.

To test the hypothesis that perturbations induced by ligand bound into the catalytic pocket are local in nature, 109 structures with ligand in the proximity of Cys145 were selected. For each complex, all residues having at least one atom within 5.12 Å from the ligand were listed. 25 residues, referred from now on as catalytic pocket residues, fulfilled distance-based criteria for 7K40, complexed with ligand boceprevir. To treat all active sites on the same footing, from original pdb files catalytic pocket residues were extracted, followed by calculation and analysis of g(r). Ligands were excluded from calculations, to preserve consistency and for easier similarity index comparison. For more details see References [50, 52].

As an additional measure of similarity, the root mean square deviations of Cα atomic positions for all structures were evaluated. To do so, structures were superimposed by creating pairwise sequence alignments first, followed by fitting the aligned residues using the MatchMaker module of Chimera [58], and default parameters. Then, the pairwise root mean square deviations (RMSD) of CA atoms in the protein backbone for all structures were calculated.

QSAR model design and validation

A list of SARS-CoV 3CLpro inhibitors with experimentally determined IC50 values constituting our training set (see Table SI1 in the Supporting Information) is obtained from the ChEMBL database [59, 60]. The IC50 values were converted to pIC50, while 3D structures were generated using the GP_global module available at the chemosophia.com site [61]. Geometries were optimized by the MultiGen algorithm for global minimization with conserving initial stereochemistry [62, 63]. Those molecules were used to reconstruct the molecular field of the model receptor using 3D-QSAR Cinderella’s Shoe (CiS) algorithm introduced in References [64,65,66]. The molecular field in the CiS method is represented by Coulomb and van der Waals potential on the molecular surface of each m-th atom of the ligand molecule with j-th pseudo-atom of the modeled receptor [67, 68]. Those contributions are calculated using the MERA force field [67, 69]. The performance of the CiS algorithm was thoroughly tested using various small molecules datasets and for different kinds of bioactivities and proved as a high quality classification scheme, with cross-validation quality usually above 0.9 [62, 65, 70,71,72]. The neural network approach was used to model the relationship between bioactivity (pIC50) and CiS descriptors. The computed bioactivity was transformed to the desirability function. The desirability function [65], offers an alternative approach in the drug classification problem, defining the probability of the activity as a value between 0 (minimum probability of bioactivity) and 1 (maximum probability of bioactivity). As an external validation of the model, the desirability function for 38 molecules experimentally verified against the SARS-CoV-2 3CLpro target was predicted (see Table SI2 in Supporting Information for the list of the molecules and their desirability function). Based on the analysis of the confusion matrix, the desirability function’s threshold value was determined, discriminating active from inactive compounds. Technical details about QSAR model design and validation are summarized in Table 1. For specific implementation details see References [67, 68].

Table 1 Parameters and validation of QSAR models for a prognosis of anti-SARS-CoV bioactivity

Bioactivity prediction

A database of FDA approved and experimental drugs was obtained from DrugBank (version 5.1.7) [53, 54]. All mixtures, charged species, and compounds containing metals were excluded, and 6407 molecules remained in the final database, constituting the prediction set. In case the 3D structure of the drug was not part of the sdf file downloaded from DrugBank, the 3D molecular structure was generated by RDKit [74]. MM3 molecular mechanics force field was used for geometry optimization and global minimum search [75], with special attention paid to avoid inversion of chiral centers. For optimized structures, activity against SARS-CoV-2 3CLpro was predicted and transformed to desirability function using our newly developed model.

Molecular docking

The structure of the SARS-CoV-2 3CLpro was taken from Reference [26]. It was extracted from 900 ns molecular dynamics simulation, as a representative structure of the dominant conformation, with a population above 86%. Standard protocol for target preparation was followed—Gasteiger charges were added to each atom and nonpolar hydrogens were merged. After atom type determination, the structure was saved as pdbqt file using Chimera [76]. AutoDockTools 4 [77] were used to prepare fifteen FDA-approved drugs with the highest predicted activity against SARS-CoV-2 3CLpro for docking. The center of the grid box was at the position of Cys145 CA atom, with Cartesian coordinates 13.3, 58.2, and 45.4, and the size of the box 20 × 25 × 25 Å3. Exhaustiveness and the number of modes were set to 100. All ligand poses within 4 kcal mol−1 relative to the pose with the highest score were saved, and after visual inspection of the plausibility, the conformation with the lowest binding energy bound was kept. Docking experiments were performed using the AutoDock Vina [78] software. The appropriateness of our approach was validated in our previous study [27].

Results and discussion

Structural analysis

All SARS-CoV-2 3CLpro from our dataset share the same primary structure. 130 out of 159 structures have two residues whose positions have not been resolved. Those missing residues are Ser1 and Gln306, in two cases, and Phe305 and Gln306 in the rest of the cases. The role of a hydrogen atom and hydrogen bonds in the chemistry of life should not be underestimated [79,80,81]. However, since a hydrogen atom has only one electron, it is extremely hard to obtain its position accurately using an X-ray crystallography. To avoid introducing additional errors into the experimental structure by modelling protonation states and the site of protonation for residues’ side chains, g(r) was calculated only for non-hydrogen atoms [51]. Properties of g(r), like that it unambiguously describes a 3D arrangement of the atoms, and it is invariant against translation and rotation of a molecule, enable us to draw some conclusions about the proteases’ structures, just by comparing its g(r). As can be seen from Fig. 2, displaying g(r) of all 3CLpros from our dataset, RDF curves are very similar, sharing the same features. For example, two sharp spikes at 1.4 Å and 2.4 Å are followed by two less pronounced spikes at 3.8 Å and 4.9 Å. The fine structure is lost for distances above 5 Å. The global maximum of the function is in the 20.4 Å to 22.4 Å range, and the difference in the g(r) maximum value is below 4%. The analysis of standard deviations showed that the g(r) of proteases differs the most at 20.5 Å (Fig. 2, right). The standard deviation at that distance is 4386.5, being only 0.71% of the maximal value at that point.

Fig. 2
figure 2

Radial distribution function weighted by the number of valence shell electrons, g(r), for a series of SARS-CoV-2 3CLpro (left), and its standard deviations (right)

The maximum of the first spike, 1.4 Å, could be interpreted as the mean interatomic distance between two neighboring atoms, while a spike at 2.4 Å is a mean distance between two atoms separated by two bonds. Since only non-hydrogen atoms are considered, only carbon–carbon, carbon–oxygen, carbon–nitrogen, and carbon–sulphur interactions contribute to the total g(r). The similarity index, σ, varies between 0.9808 and 0.9998. Having in mind that there are no mutant proteases in our dataset, the differences in the structure can be explained by different experimental conditions or as a structural rearrangement due to perturbation introduced by bound ligands. In addition, in our dataset, both covalent and non-covalent inhibitors are present. High similarity between g(r) indicates that the structural changes experience the tertiary structure of the enzyme, most probably by reorientation of flexible loops and/or side chains.

Based on the analysis of the residues having at least one atom closer to a ligand then 5.12 Å, 25 catalytic pocket residues were identified (Thr25, Thr26, Leu27, His41, Met49, Tyr54, Phe140, Leu141, Asn142, Gln143, Ser144, Cys145, His163, His164, Met165, Glu166, Leu167, Pro168, Val186, Asp187, Arg188, Gln189, Thr190, Ala191, Gln192) (Fig. 3). As a template, a 7K40 structure was used, with boceprevir being the inhibitor having the most close contacts. On the other side, complexes 5RHB and 5RHC, having small methanimine derivatives, have only nine residues fulfilling the 5.12 Å criterion. To treat all structures on the same footing, in our analysis of the active pocket, we include all 25 catalytic pocket residues for all structures, and to reduce ‘noise’, inhibitors were excluded. g(r) of catalytic pocket share similar features for small distances with g(r) of all proteases, with resolved maxima at 1.4 Å, 2.4 Å, 3.7, Å and 4.8 Å. The similarity index, σ, is more spread, being in the range between 0.9485 and 0.9919. While the mean σ for the catalytic pocket is 0.9860 ± 0.0060, σ equals 0.9971 ± 0.0025 for the whole enzyme for the same data set. Two-sided Student’s t test showed that the difference is statistically significant (t = −17.9, p = 1 × 10–44). This finding corroborates our assumption that although large amplitude motions of domain III influence the geometry of the active site, perturbations introduced by bound ligands (even covalently bound) are local. One can see from the inset on Fig. 3 that g(r) of the active pocket has the biggest standard deviation at 8.8 Å. At this distance, we identify atoms with dominant contributions to the g(r = 8.8 Å). The dominant contribution is defined as a g(r) value larger than the mean g(r) contribution plus two standard deviations for a specific distance r [50]. Nitrogen atoms forming peptide bond from Glu166 and Met165, and peptide’s bond oxygen atom from Val186 are three atoms with a dominant contribution to 101, 99 and 94 complexes out of 109, respectively. While Met165 and Glu166 are part of the β sheet in domain II, Val186 is part of a nonstructured loop connecting domains II and III. In the 7K40 complex of 3CLpro with boceprevir, Glu166’s nitrogen forms a hydrogen bond with boceprevir’s carbonyl oxygen and the distance between the two atoms is 2.95 Å. Met165’s nitrogen and Val186’s oxygen are not in direct contact with the ligand, but those residues are neighbors through space and influence the depth of the active site. It is interesting to note a big difference in the occurrence of the atoms of two catalytic dyad residues as atoms with a dominant contribution. While the NE2 atom of His41 is dominant for 89 structures, the S atom of Cys145 is having the dominant contribution in only 11 complexes. Covalent inhibitors are binding to the S atom of Cys145, restricting its flexibility. At the same time, His41 has to adapt to bound ligands, and reorientation of the imidazole ring is the way to optimize interaction patterns.

Fig. 3
figure 3

The catalytic pocket of 7K40, with inhibitor boceprevir (left). Radial distribution function weighted by the number of valence shell electrons, g(r), for a series of SARS-CoV-2 3CLpro catalytic pocket (right), and its standard deviations (inset)

Well-established procedures for getting insight into the structural differences include widely accepted protein overlay and root mean square deviation calculations. Pairwise RMSD of carbon atoms in the backbone was calculated. The resulting heat map of RMSD values is presented in Fig. 4. Although here side chains are neglected, and the analysis was performed only for the enzyme’s backbone, some valuable conclusions could be drawn. Only five structures have RMSD values above 1.0 Å—6LZE, 6M0K, 6W79, 7BUY, and 7JU7. 6M0K and 5RF9 differ the most, with the RMSD value being 1.78 Å. When those two structures are overlaid, one can see that the most significant difference is in the C-terminus. The last few residues, starting from Cys300, showed the greatest flexibility. In 5RF9, they are oriented toward domain II, while in 6M0K (and 6LZE, 6W79, 7BUY, and 7JU7) are pointing to the side of the domain III. Additionally, this position enables interaction between C- and N-terminus, with interatomic distance between CA atoms of Ser1 and Val303 being below 6.4 Å. The reason for the higher flexibility might lie in the fact that they do not participate in secondary structure formation and as terminal residues, their motion is restricted only from one side. Both N-finger and residues around the C-terminus’ last helix are known to have an important role in the enzyme dimerization [82, 83].

Fig. 4
figure 4

Structural analysis of SARS-CoV-2 3CLpro. Pairwise RMSD presented as a heat map (left). An overlay of two structures with the highest RMSD value (right) highlighting the area with the biggest structural difference (yellow rectangle). 5RF9 (blue), 6M0K (red)

Bioactivity prediction and docking

Reliable models with high predictive power are ‘must have’ tools for successful drug repurposing. Tenfold cross-validation of our model was performed as an internal validation method. The cross-Q2 equals to 0.91, indicating the model’s robustness and high predictive ability. Recently, Mody et al. [73] performed an in vitro enzymatic inhibitory assay study, testing enzymatic activity of 3CLpro against selected drugs (including viral protease inhibitors, viral non-protease inhibitors, and off-target drugs). We used those results as an external set to validate our model. The desirability function value of 0.82 is identified as the threshold for binary classification of compounds. Compounds are classified as being inactive when the desirability function predicted value is lower than 0.82, or active if equal or higher. The threshold is obtained by analyzing the confusion matrix, and statistical parameters derived from it, like accuracy and Matthews correlation coefficient (MCC). The elements of the confusion matrix, true positive (TP), false positive (FP), true negative (TN), and false negative (FN), were used to calculate both the MCC \(\left( {\frac{{{\text{TP}} \cdot {\text{TN}} - {\text{FP}} \cdot {\text{FN}}}}{{\sqrt {\left( {{\text{TP}} + {\text{FP}}} \right) \cdot \left( {{\text{TP}} + {\text{FN}}} \right) \cdot \left( {{\text{TN}} + {\text{FP}}} \right) \cdot \left( {{\text{TN}} + {\text{FN}}} \right)} }}} \right)\) and the accuracy \(\left( {\frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}}}} \right)\). For the desirability function being 0.82, the accuracy and the MCC are 0.84 and 0.41, respectively. Here is important to point out that according to Mody et al. [73] lopinavir is classified as non-active. But if lopinavir is classified as active, according to Zhang et al. [84], desirability function threshold is 0.87, with improved accuracy and MCC values of 0.89 and 0.62, respectively. Ivermectin, tipranavir and paritaprevir, with experimental IC50 values equal to 21.5 μM, 27.7 μM, and 73.4 μM [73], are also predicted to be active against 3CLpro enzyme by the model.

Since our model demonstrated predictive potential during validation, we predicted the activity of 6400 molecules from the DrugBank database against SARS-CoV-2 3CLpro. All results are compiled in Table SI3 in Supporting Information. 17 molecules with the highest activity are listed in Table 2. The highest predicted activity against 3CLpro has toremifene, a nonsteroidal selective estrogen receptor modulator, used in the treatment of advanced breast cancer. According to ClinicalTrials.gov (identifier NCT04531748), a randomized, double-blind, controlled clinical trial is trying to evaluate the effects of toremifene in adults with mild COVID-19 [85]. Martin and Cheng suggested toremifene’s mechanism of action as a potential blocker of the spike glycoprotein and methyltransferase nonstructural protein 14 (NSP14) inhibitor [86]. But, 500 ns long molecular dynamics simulation of 3CLpro complexed by toremifene showed that after 284 ns toremifene leaves the binding pocket.

Table 2 Top 17 compounds from DrugBank with the highest predicted bioactivity against SARS-CoV-2 3CLpro, together with their 2D structures and docking score (in kcal mol−1)

The mechanism of cleavage of polyproteins by 3CLpro has been investigated for the SARS-CoV virus [18, 87]. Since the initial step includes deprotonation of Cys145 thiol and nucleophilic attack of anionic sulfur on the carbonyl carbon atom, a variety of peptidomimetics and small molecule covalent inhibitors have been proposed [88]. Independent of the nature of the potential inhibitor (covalent/non-covalent targeting active pocket), it has to interact with the target. Being aware of the drawback of the docking experiments [89,90,91], we performed docking of hit molecules against SARS-CoV-2 3CLpro, to get a general idea about interactions between potential inhibitor and target within the catalytic pocket. 900 ns molecular dynamics simulation of free, unbound 3CLpro performed by Novak et al. identified two dominant enzyme’s conformations, with predicted populations of 86.7% and 13.3% [26]. Although the main structural difference is the large amplitude motion of domain III, it affects the geometry of the catalytic pocket, being wider in the dominant conformer. Succinamide-CoA (DB03905) is a quite big compound, classified as experimental, with a molar mass of 850.6 g mol−1. Because of flexibility and numerous functional groups, it is capable to form a variety of interactions with residues forming the pocket (Fig. 5). For example, it forms hydrogen bonds with Thr26, Asn28, Asn119, Phe140, Asn142, Gly143, Cys145, Hie163, Hie164 residues. When it is bound, it completely blocks access to catalytic dyad residues, His41 and Cys145 (yellow in Fig. 5, left). 2D interaction networks between all hit molecules and 3CLpro are presented in Figure SI1 in Supporting Information. In most of the cases, hydrogen bonds, π-sulphur, π-alkylic, ππ, and van der Waals interactions are responsible for enzyme—ligand binding. More accurate calculations, like molecular dynamics simulations and free energy of binding calculations, are needed to provide a detailed description of complex interaction patterns.

Fig. 5
figure 5

Molecular docking results—insight into the catalytic pocket (left) and interaction network (right) of Succinamide-CoA (DB03905) and SARS-CoV-2 3CLpro. Van der Waals surface of His41 and Cys145 is depicted in yellow, all other residues’ are in blue

Several drug repurposing studies suggested well-known antiviral drugs as potential SARS-CoV-2 3CLpro inhibitors. According to DrugBank [54], more than 70 compounds within our dataset are registered direct antiviral agents. Therefore, we were interested in the performance of our model and its possibility to point out promising antiviral drugs as 3CLpro inhibitors. Our predictive model identified eight currently used antiviral drugs as active against SARS-CoV-2 3CLpro (Table 3). In this section, we will put our results into a broader perspective, and see how the proposed molecules perform in in vitro, in vivo and in in silico studies. Lopinavir, a well-known HIV-1 protease inhibitor is predicted to have the highest activity against 3CLpro. It has been used to treat SARS and MERS patients but without proven efficacy [4]. Although lopinavir and its analogues were subjected to many studies [92,93,94,95], in clinical trials lopinavir-ritonavir combination have not been proven to be effective for the treatment of severe cases of COVID-19 [30]. Zhang et al. [84] demonstrated lopinavir inhibition potential against SARS-CoV-2 3CLpro in vitro, and by extrapolation to in vivo, they concluded that due to very low concentration of free, unbound to plasma proteins, lopinavir is not effective against SARS-CoV-2 in vivo. Paritaprevir [31, 32, 73, 96], favipiravir [34, 92], atazanavir [97,98,99], ganciclovir [95, 97], tipranavir [73, 98, 100,101,102], and bictegravir [32], were part of computational studies trying to repurpose existing drugs against COVID-19. Paritaprevir is a compound containing an acylsulfonamide moiety and is being used in treatment of hepatitis C. It is inhibiting viral NS3/4A serine protease, with Ser139, His57 and Asp81 constituting catalytic triad [103, 104]. Favipiravir, a pyrazinecarboxamide derivative, is a broad spectrum inhibitor of RNA viral replication, currently registered for influenza treatment [105, 106]. Clinical trials indicate its potential use on moderately to critically ill COVID-19 patients [107, 108]. Although computational studies of atazanavir and tipranavir, a HIV-1 protease inhibitors, suggested they might be good 3CLpro inhibitors, the careful analysis of its efficacy in cell culture and in vitro enzymatic assays revealed limited potential due to the requirement of high concentrations of the drugs to achieve significant inhibition [98]. Results of those studies show that mentioned antiviral drugs have potential to fight COVID-19 pandemic, and at the same time are an independent validation of our model. From the pool of more than 6400 different molecules, our approach enriched the final list of molecules with the compounds that were identified as active either by other theoretical methods or by experiments.

Table 3 List of direct antiviral drugs that are predicted to show inhibition activity against SARS-CoV-2 3CLpro

Conclusions

This study had three goals. First, structural similarity analysis based on radial distribution function weighted by the number of valence shell electrons of SARS-CoV-2 main protease obtained by X-ray crystallography was performed. Independent from different experimental conditions of crystallization, different space groups and different inhibitors bound into the enzyme’s catalytic pocket, the RDF-based similarity index is within the 0.9808 and 0.9998 range. This suggests that perturbations of the 3CLpro introduced by the ligand are local, concentrated in the vicinity of the active pocket. This finding is corroborated by independent analysis of the RMSD of CA atom type from protein’s backbone and additional analysis of the g(r) of the catalytic pocket.

The second goal was achieved by successful design and validation of the QSAR model capable of predicting activity against SARS-CoV-2 3CLpro. After reconstruction of the pseudo-receptor complementary to the external field of bioactive molecules using the CiS algorithm, the neural network was used to train the model. Internal predictive power was tested by tenfold cross-validation, giving the cross-Q2 equal to 0.91. Since a high value of cross-Q2 is a necessary condition for a model’s high predictive ability, but it is not a sufficient condition, additional external validation was performed. The value of R2 of 0.90 demonstrated the model’s high predictive ability for external molecules.

Finally, a newly developed predictive model was exploited for drug repurposing. From the list of FDA approved and experimental drugs, we identified molecules with the highest probability of being SARS-CoV-2 3CLpro inhibitors. Special attention was paid to existing antiviral drugs. Lopinavir, a HIV-1 protease inhibitor, is predicted to have the highest potential to inhibit SARS-CoV-2 3CLpro. Although it is confirmed by in vitro experiments that it inhibits SARS-CoV-2 3CLpro, its effectiveness in the treatment of severe COVID-19 cases is questionable due to the very low concentration of free lopinavir, unbound to plasma proteins. Other antiviral agents, like paritaprevir, identified by our model as prosperous are also under investigation by other groups or have already reached clinical trials. These independent results support the good performance of the model. Benefits of this research include short-term benefits, including fast drug repurposing possibilities, and on a long-term scale, reliable model for prediction of bioactivity against 3CLpro is developed and validated.