Introduction

Drug discovery is one of the great challenges and molecular modeling approaches that aim to identify lead compounds. A typical chemical space size is estimated to be 1060 molecules, and screening these molecules experimentally for the identification of lead molecules is a mammoth task. Virtual screening applies a series of filters to identify potential lead compounds from a huge pool of compounds [1,2,3]. Many commercial and open-source drug discovery software are available for drug discovery to minimize time and cost. Development of open-source drug discovery software along with disease-specific information, a robust compound library, and fragment library is essential for the identification of potential lead compounds with improved activity. One such effort from our group is the development of open-source computational drug discovery software MPDS [4, 5].

Fragment-based drug design (FBDD) is an effective method for quick and precise identification of chemical moieties to design selective lead molecules with high affinity [6, 7]. The outcome of FBDD is directly influenced by the composition of the fragment library being used [8, 9]. Several methods are available for the fragmentation of molecules and they are broadly categorized into four categories based on the pattern of fragmentation such as (a) hierarchical and systematic, (b) retrosynthetic, (c) knowledge-based, and (d) random fragmentation. All these four methods use the molecular connectivity for generating fragments. The systematic graph fragmentation and computer-generated fragments based on retrosynthetic rules are the two most widely used methods for constructing the fragment libraries in FBDD. The graph fragmentation cleaves the molecules at topologically defined positions such as the bond between a ring and its substituents, whereas computer-generated retrosynthesis uses defined rules based on the chemical reactions and fragments, such as amide bonds. Bemis and Murcko have examined the feature of drug molecules based on the rings, linkers, frameworks, and side-chain atoms as well as based on the atom type, hybridization, and bond order of molecules [10, 11]. They emphasized the most common frameworks that represent the structural variation in the drug molecule dataset. According to Murcko, (i) ring systems are cycles that share the edges, (ii) atoms that connect two ring systems are linkers, (iii) substituents are atoms that are neither ring nor linkers. Physiochemical properties-based rules are also employed by many people to construct fragment libraries. Congreve et al. [12], discovered the “rule of three,” viz. (a) Molecular weight (MW) ≤ 300 Da; (b) Hydrogen bond donor (HBD) and Hydrogen bond acceptor (HBA) ≤ 3; (c) LogP ≤ 3.42 to define the physiochemical properties of a molecule. Several molecular fragmentation techniques are being developed by researchers to obtain synthetically feasible chemical motifs/fragments (Table S1). Among these fragmentation methods, RECAP and BRICS are retrosynthetic fragmentation methods. In RECAP, molecules are fragmented based on a group of eleven defined bond categories, and make certain rings that are key structural moieties of molecules. However, BRICS rules are the expansion of the RECAP rules that consider the environment of every bond type and its surrounding substructures [13, 14]. In addition, some other fragmentation methods and unexplored fragments spaces are need to be explored [15,16,17]. Consequently, many fragmentation techniques for FBDD have been developed which consider binding site and fragment connection information from a macromolecule-ligand complex [18,19,20,21,22,23,24,25]. Our group has also made a series of fragment and structure-based studies using the traditional computer-aided drug design, artificial intelligence, and machine learning approaches in probing the molecular or structural properties of small molecules, macromolecules, and other complexes [26,27,28,29,30,31,32,33]. The applicability of these fragmentation methods depends on the specific purpose for which they are being used or implemented. Several approaches have been developed for the construction of fragment libraries and most of the commercial fragment libraries were found to obey widely accepted “rule of three.” Over the years, various computer programs were also developed to enumerate the molecules, and currently, databases of the order of 1020 are created and considered ultra-large chemical repositories [34, 35]. Our approach is based on the identification of several types of fragments through a systematic fragmentation method that breaks a given molecule into rings, linkers, and substituents to understand the diversity of chemical space. We explored the vast chemical libraries that includes known and bioactive molecules namely ChEMBL [36], DrugCentral [37], PDB ligands [38], COCONUT [39], and molecules from the SAVI database [40]. Most of these fragmentation tools are not publicly available as web services and thus cannot be used by many computational drugs design. Elead3D and ACFIS, screen inbuilt fragment libraries against targets for FBDD are the few available computational drug design web services [41, 42]. While virtual screening facilities are available for fragments, there is a lack of a web server that offers fragment library construction tools, inbuilt fragment libraries, and screening facilities all together in a single platform. This impelled us to develop and integrate the FBDD module in the molecular property diagnostic suite (MPDS) suite of web portal [3,4,5].

Materials and methods

Data curation and fragment generation

The bioactive molecules from ChEMBL (2,105,464), DrugCentral (4099), and PDB ligands (36,209), a set of natural products from COCONUT (406,747), and theoretically generated molecules from the SAVI database (~ 1 Billion) were retrieved. The database identifiers for all these molecules were retained and molecules from each of these databases were individually subjected to fragmentation for obtaining a set of rings, linkers, and substituents using an in-house python script that uses the “fragmentonbonds” function in RDkit. As the ring systems obtained from the fragmentation represent rigid entities that create the scaffold of a compound, all further analyses focused only on rings. The fragmentation algorithm in RDkit cleaves the bond between the atoms that are part of a ring and atoms that are not a part of the ring which results in the generation of small fragments [43]. We have filtered out the fragments based on the rule of three [12] thus reducing the number to a limited number for detailed analyses. All the ring fragments were further classified into structural categories, such as monocyclic, bicyclic, tricyclic, tetracyclic, and polycyclic. The complete procedure used in this study is depicted in Scheme 1.

Scheme 1
scheme 1

The integrative methodology used for the generation of the fragment library

De-duplication of ring fragments and fingerprint generation

After fragmentation, the SMILES of the ring fragments contained the patterns like “[*1]” or “[*10]” that indicated the position where the bond was cleaved. Hence, these SMILES were first cleaned to remove all such patterns, and thereafter it was subjected to the computation of InChIKey using OpenBabel3 [44]. A two-step redundancy removal was carried out, first at the database level, and second in the structural category groups using an in-house python script. As InChIKey is considered to be unique for a molecule, it was aptly used to de-duplicate all the redundant ring fragments while the database ids of all the duplicate rings were retained. The information from the database ids was used to generate a 5-Bit fingerprint, and the value ‘1’ shows the availability and ‘0’ for the non-availability of the particular ring fragment in all the five databases.

Probing the unique and overlapping ring fragments

The diversity of ring fragments were further analyzed at both the database and structural category level using the fingerprints. At the database level, rings that were common or found in 2 or more databases, and rings that were unique to a specific database were identified. While at the structural level, the biologically active and ring systems generated through de novo approaches were analyzed. The Tanimoto coefficient scores were calculated using DataWarrior tool [45] and the scores were extensively used to fetch the top diverse fragments as well as their frequency of occurrence.

Similarity analysis

The unique set of rings that were identified after de-duplication were also analyzed on the basis of the Morgan fingerprints. Tanimoto coefficient was computed to generate a matrix, which was further analyzed for ranking the rings based on similarity scores.

Results and discussion

Generation of the molecular fragments

The molecular dataset were retrieved by compiling several public domain databases such as ChEMBL, DrugCentral, PDB ligands, COCONUT, and SAVI. Further, using an in-house python script these molecules were fragmented into a set of rings, linkers, and substituents. The cyclic fragments (interconnected ring systems) were classified into a set of monocyclic, bicyclic, tricyclic, tetracyclic, and polycyclic categories which was necessary for introspecting the structural diversity of fragments. A total of 107,614 non-redundant ring fragments were obtained and based on the database identifier, a 5-Bit fingerprint was generated. These bit positions indicate each database, starting from ChEMBL, followed by COCONUT, DrugCentral, PDB Ligands, and SAVI. An example of a fragmentation procedure is shown in Fig. 1.

Fig. 1
figure 1

Process of generation of fragments from molecules into rings, linkers, and substituents

Analysis using the fingerprint

The 5-bit fingerprint was used to map both the unique and common set of molecules that overlapped among 2 or more databases. We were able to observe which type of ring fragment was very unique and which fragments were spread across the databases through this analysis. Figure 2 shows the pie-plot showing the percentage of overlap of fragments among the databases, and this indicates that the maximum overlap is from the tricyclic (62.3%), followed by bicyclic (20.1%), tetracyclic (10.70%), monocyclic (4.3%), and polycyclic (2.5%) groups. The number of overlapped fragments between ≥ 2 databases is listed in Table S2. A large overlap was observed between ChEMBL and COCONUT database, with a total of 45.53% compounds, followed by the overlap between ChEMBL and SAVI, and so on (Table S2). It is also noted that the rings from DrugCentral and PDB-Ligand datasets have least or no overlap with other databases, indicating that the core scaffolds of drugs and drug-like molecules share very less structural similarity. Thus, exploring novel scaffolds from COCONUT, ChEMBL, and SAVI, followed by detailed analysis for its targets will be a promising task to endeavor in small molecule-based drug discovery approaches.

Fig. 2
figure 2

The percentage of common fragments based on its structural diversity

Analysis of unique fragments

Structurally classified fragments are summarized in Table 1, Table S3, and Fig. 3. It can be observed that the maximum of the ring fragments is from the tricyclic (48,842) followed by tetracyclic (42,944), polycyclic (8069), bicyclic (7195), and monocyclic (564) groups. Subsequently, database-wise distribution of these fragments showed that the maximum number of unique fragments have been obtained from SAVI (74,014), followed by ChEMBL (24,065), COCONUT (8799), PDB ligands (698), and DrugCentral (38). This analysis suggests that SAVI, ChEMBL, COCONUT, and PDB ligands databases are very important in terms of structural diversity, and they are key resources that can be used for the development of a highly diverse fragment library.

Table 1 The population of different types of structurally unique fragments
Fig. 3
figure 3

Percentage distribution of the ring fragments in the dataset based on A type of ring and B different databases

Analysis of structural diversity and frequency of occurrence

The diversity of non-redundant fragments was investigated by comparing the fragments to each other and calculating the all-by-all similarity matrix using the Tanimoto coefficient. Analysis and interpretation of the resultant matrix (107,614 × 107,614) is difficult and therefore we have generated a separate matrix for only fused fragments. Subsequently, the data point of the matrix has been arranged in a 2D plane by grouping the distribution of Tanimoto scores, as shown in Figure S1. The results exhibited that the Tanimoto score of ~ 80% and ~ 18% of the fragments are nearly 0.1 and 0.2, respectively, i.e., ~ 98% of the fragments are highly dissimilar, suggesting that the generated library of the ring fragments (20,225) is very diverse.

It showed that tricyclic ring fragments are a more diverse dataset followed by tetracyclic, polycyclic, and bicyclic ring fragments in terms of the Tanimoto coefficient. Followed by these analyses, the diverse fragments are ranked within each structurally classified group using algorithms implemented in the DataWarrior software and the highly ranked fragments are subjected to the frequency of occurrence analysis against the ChEMBL database. In Fig. 4, the top ten most frequent unique fragments from monocyclic, bicyclic, tricyclic, tetracyclic, and polycyclic groups that are biologically active as found in the ChEMBL database are represented. The highest occurrence is observed for the monocyclic ring fragments followed by the tricyclic, bicyclic, tetracyclic, and polycyclic ring fragments. Among the top ten fragments from each of these groups, the highest occurrence is found in the monocyclic group with 0.684% (14,391) fragments, and the lowest occurrence is found in the polycyclic group with 0.003% (59) fragments. It is also observed that in all the structural categories, the 10 most frequent ring fragments are dominated by the hetero-aliphatic rings in comparison to carbo-aliphatic rings. However, the hetero-aliphatic and carbo-aliphatic rings seemed to be highly comparable in the case of large ring size fragments. The observation is in good accordance with the observation made for GSK medicinal compounds [46]. The dominance of hetero-aliphatic rings may be preferred because of their higher hydrophilicity which often leads to improving the solubility of the chemical compounds [47], and also due to being functional in the forming of hydrogen bonds and other interactions with the biological systems. As the hetero-aliphatic rings are observed to be an important component of biologically important molecules, we have extended our analysis towards examining the count of different heteroatoms such as N, P, O, and S in the ring structure, as shown in Fig. 5 and Table S4. Results showed that the presence of N and O or the combination (N + O) are the highly occurred heteroatoms (i.e., 75.37%) in the ring fragments followed by the ring containing N in combination with S (17.58%). Interestingly, it has been observed that the distribution of carbo-cyclic ring structures is higher (0.79%) than the hetero-aliphatic rings containing S or P or O + S or the ring containing more than two heteroatoms (Fig. 5). This distribution of carbo-cyclic ring structure might be coming from the set of bigger ring size fragments as the occurrence of carbo-aliphatic rings seemed to be highly comparable to hetero-aliphatic rings in the case of bigger ring size fragments (Fig. 5). Overall, results suggested that the diversity of ring fragments is majorly offered by the hetero-aliphatic rings and the N and O containing rings are among the most significant structural component.

Fig. 4
figure 4

Ten most frequent unique fragments from monocyclic, bicyclic, tricyclic, tetracyclic, and polycyclic groups against the ChEMBL database. The occurrence of these fragments in the ChEMBL dataset is indicated in the normal font and the percentage in bold font

Fig. 5
figure 5

Distribution of unique fragments based on heteroatoms

Incorporation of fragment library module in molecular property diagnostic suite

Molecular property diagnostic suite is a galaxy-based open-source computational drug discovery software developed in our group. The MPDS has been generated for different diseases namely tuberculosis [3, 4], diabetes [5], COVID-19, etc. The modules in MPDS are classified into disease-dependent and disease-independent modules. The disease-dependent modules are customized to specific diseases and gene information, drug information, etc., are majorly available in this module. The disease-independent modules are majorly drugged discovery modules such as molecular docking and binding site prediction. The MPDS also has a compound library of million compounds with structural classification.

The identified fragments have been incorporated into the MPDS under the MPDS fragment library section. A total of 107,614 fragments are classified into monocyclic, bicyclic, tricyclic, tetracyclic, and polycyclic. Different properties such as molecular weight, no. of hydrogen bond donors, no. hydrogen bond acceptors, molar refractivity, number of heavy atoms, number of rotatable bonds, logP, and topological polar surface area (TPSA) have been calculated for each fragment and incorporated into the MySQL database. The users can select types of rings (i.e., monocyclic, bicyclic, tricyclic, etc.) in the drop-down menu and the corresponding fragments and the structures will be displayed in the MPDS platform.

Conclusions

This work resulted in the development of ring fragments library with high diversity from the hetero-aliphatic ring structures. This library can allow the sampling of large chemical spaces comprising molecules with known biological activity, natural products, and theoretically generated molecules. The information provided from the analysis of the generated frsagment library can help in finding novel drug molecules which is a critical step for biomedical research. In this context, this library module is available in the MPDS web portal (http://mpds.neist.res.in:8085) to assist in molecule design, especially for computer-aided drug design. This module provides an inbuilt fragment library database to extract molecular fragments (classified as rings, linkers, and substituents), an open-source fragmentation tool, and a virtual screening tool. The fragment library in MPDS will be continuously updated in regular intervals with fragments from new chemical entities offering varied skeletons and scaffolds for designing molecules. We hope that this module will allow the scientific community to use the FBDD approach more effectively in the computational design of molecules.