Virtual screening approaches are extensively used computational methods in modern drug discovery projects and they often replace or help to reduce more expensive and time-consuming high-throughput screenings nowadays [1]. There are two major categories of screening approaches: ligand-based and structure-based methods [2].

Ligand-based methods are typically used if no X-ray structure of the target receptor is available. A single compound or a set of compounds known to bind to a specific target or to be active in a functional assay is typically used as the template to identify similar compounds in a large virtual database. In general, similarity can be evaluated on the basis of 2D and 3D molecular representations [3]. The classical 2D chemical similarity representations is based on molecular fingerprints (e.g. circular fingerprints, topological fingerprints, substructure fingerprints) transforming the molecular representation into a bit vector. The similarity between two vectors is then calculated with various similarity measures, most common is the Tanimoto coefficient. 3D similarity methods mainly consider the shape comparison of two molecules, typically extended by 3D pharmacophoric features, e.g. ROCS is considered the industry-leading commercial program for shape-based screenings [4].

Structure-based approaches, in most cases classical docking methods, are typically preferred if the target 3D structure information is available [5]. However, 2D ligand-based methods often require only a fraction of second for a single structure comparison task which allows to perform large screenings within a few hours even on a single, standard CPU. In contrast, docking methods are already considerably more resource demanding and time-consuming, not to mention more elaborated methods such as molecular dynamics simulations [6]. As a consequence, ligand-based methods are very attractive options for initial attempts to identify or filter relevant compounds in large and ultra-large virtual databases [7]. Furthermore, they are valuable tools to identify close analogues of known active compounds in a time efficient manner. In the last couple of years, several methods have been developed to screen non-enumerated chemical spaces up to 1015 compounds and beyond in seconds to minutes on standard hardware [8]. The most elaborated technique for large space screening are chemical fragment spaces with corresponding connection rules, e.g. BioSolveIT’s fragment spaces in connection with FTrees similarity implemented in their infiniSee software allows the screening of huge chemical spaces (e.g. Enamine REAL space) in seconds on standard hardware [9, 10].

There are many open-source web servers available for the screening of enumerated compound libraries using a variety of different structure- and ligand-based methods, recently reviewed by Singh et al. [11]. For example, many well-known databases such as ChEMBL, PubChem or ZINC include ligand-based similarity search functionalities with molecular fingerprints and/or substructure searches [12,13,14]. The web tool SwissSimilarity allows for the 2D fingerprint and 3D shape screening of common public databases and compound libraries of most commercial vendors such as Enamine or ChemDiv [15, 16]. Pharmit additionally offers the possibility to screen large databases based on pharmacophore queries [17].

Several standalone tools focusing on enumerated 2D ligand-based screening approaches are available, most of which are commercial products [8]. Prominent examples are Schrödinger‘s GPUSimilarity integrated in their LiveDesign suite using a GPU-powered server in the background, Arthor‘s NextMove software with a SMARTS-based pattern matcher and Andrew Dalke‘s chemfp command line tool [18,19,20].

To the best of our knowledge, there is no open-source command line tool available which is similar to the SwissSimilarity or Pharmit web server and which allows for the comprehensive screening of databases and library files using different 2D and 3D ligand-based screening approaches, all combined in one tool.

In the following, we report an open-source command-line tool called “Virtual Screening WorkFlow” (VSFlow) written in Python and containing three different ligand-based screening modes. It relies on the open-source cheminformatics sofware RDKit [21]. VSFlow includes a substructure-based and fingerprint-based screening mode (2D) as well as a 3D shape-based screening mode (Fig. 1). Additionally, it possesses two tools for preparing and managing compound databases for virtual screening.

Fig. 1
figure 1

Different screening functionalities of VSFlow


VSFlow is written in Python, is open-source and can be downloaded from It is licensed under the MIT license. As a prerequisite, a working installation of Anaconda or Miniconda is needed [22]. VSFlow including all dependencies can then be installed with the provided yml file as follows:

figure b

The Python dependencies are rdkit, xlrd, xlsxwriter, pdfrw, fpdf, pymol-open-source, molvs and matplotlib [23, 24]. VSFlow requires Python version 3.7 or higher.

VSFlow includes 5 separate tools: preparedb, substructure, fpsim, shape and managedb (Fig. 1). All functionalities of VSFlow can also be run in parallel on multiple cores/threads. Parallelization is implemented via Python’s built-in multiprocessing module.

preparedb: prepare databases

VSFlow contains a tool to prepare compound libraries for virtual screening (preparedb). It allows for standardization of the molecules, generation of fingerprints and generation of multiple conformers (Fig. 2). The output file is a “virtual screening database” (.vsdb) file. The vsdb file is a Python pickle file containing all information in a special Python dictionary format which significantly enhances loading speed compared to SD files, particularly relevant for larger databases. Standardization is done on the basis of the MolVS rules and includes charge neutralization, salt removal and optionally tautomer canonicalization [23]. Fingerprints are generated with the RDKit chemistry framework. Conformers are generated with the RDKit ETKDGv3 method and optimized with the MMFF94 forcefield [25]. The following options are available:

  • standardize: standardizes molecules, removes salts and associated charges

  • conformers: generates multiple 3D conformers for database molecules

  • canonicalize: adds the canonical tautomer to the database

  • fingerprint: generates the respective fingerprint for each molecule and stores it in the database

It is also possible to directly download the PDB ligands and the chembl database and store them as vsdb databases, e.g.

figure c

The above command will download all pdb ligands, standardize the molecules (-s argument), calculate the ECFP2 fingerprint (-f and -r argument) for every molecule and store it along with the molecule in the database (-o argument). You can repeat this for the ChEMBL database, e.g. with a different fingerprint:

figure d
Fig. 2
figure 2

Preparedb functionality of VSFlow: prepare compound libraries for virtual screening

substructure: substructure search

The substructure search (substructure) is performed based on the GetSubstructMatches() functionality available for RDKit Mol objects.

fpsim: fingerprint similarity search

The fingerprint generation relies on the RDKit framework. All fingerprints currently implemented in the RDKit (Morgan, RDKit, Topological Torsion and Atom Pairs fingerprint and MACCS keys) are supported and different similarity measures (Tanimoto, Tversky, Cosine, Dice, Sokal, Russel, Kulczynski and McConnaughey similarity) can be used.

shape: shape-based screening

Several functionalities of RDKit were combined to perform a screening based on a compounds’ molecular shape (Fig. 3). First, generation of conformers (RDKit ETKDGv3 and MMFF94 forcefield) is done for 2D query structures. Conformers for database compounds can be generated using the preparedb functionality. Then, conformers of each query molecule are aligned to all conformers of each database molecule with the RDKit Open3DAlign functionality, either using MMFF94 force field parameters or Crippen atomic logP contributions (user-defined). In the next step, for every conformer pair the shape similarity is calculated (TanimotoDist, TverskyShape or ProtrudeDist) and the most similar conformer pair for every query/database molecule pair is selected (RDKit rdShapeHelpers). For the selected most similar conformer pair a 3D pharmacophore fingerprint is generated (RDKit Pharm2D) and the fingerprint similarity is calculated. By default, a combined score (combo score), the average of shape similarity and 3D fingerprint similarity, is used to rank the database molecules. The intended use case of the shape screening mode is to screen a database of compounds with multiple conformers (prepared e.g. using the preparedb functionality of VSFlow) and to use a query ligand in a single, bioactive conformation, e.g. from the pdb database.

Fig. 3
figure 3

Different steps and RDKit functionalities which were combined to perform a screening based on pharmacophore alignment and shape similarity

managedb: manage databases

The mode managedb is a convenience tool to update and manage compound databases which are integrated into VSFlow. A detailed description can be found in the VSFlow wiki [26].

Results and discussion

In the following section, the intended usage of VSFlow including some example commands are presented. A detailed description of the multiple possibilities to use VSFlow along with specific examples can be found in the VSFlow GitHub wiki [26].

In order to demonstrate the three main functionalities of VSFlow together with both its versatile input and output formats, we took the tyrosine-kinase inhibitor dasatinib as query molecule. As database, an SD file of the FDA-approved drugs generated from the ZINC database was used, comprising over 1600 molecules [14]. This database is also available in our GitHub repository.

Substructure search

For the substructure search, a SMARTS representation of the thiazole function of dasatinib was taken as input to see how many other drugs might have that specific group. Besides the 36 hits (one of them, of course, dasatinib itself) in which the thiazole group was found, three molecules even have two thiazole groups, namely cefditoren, cobicistat and ritonavir. A pdf (supporting information) was generated displaying a table of the found hits with the 2D structures and the found substructure match highlighted in red as well as the information of the hit (e. g. ID, SMILES, Fig. 4). It should be mentioned that a pdf can only be generated in addition to an sdf, excel or csv file.

figure e
Fig. 4
figure 4

Examplary page of the pdf file generated after substructure search. The left column shows the hits with the substructure matches highlighted in red, the right column the ID of the hits as well as the SMILES and the query SMARTS

Fingerprint similarity

For the fingerprint similarity function fpsim, a SMILES input of the molecule was used with default parameters, i. e. an FCFP4-like Morgan 2048 bits of radius 2 for which the Tanimoto coefficient was calculated. A pdf file was selected as output format as well as an Excel file. The simmap parameter will generate a similarity map that visualizes the contribution of the specific atoms to the similarity between the molecules in the database and dasatinib (Fig. 5) [27].

figure f
Fig. 5
figure 5

Examplary page of the pdf file generated after fpsim search. The fingerprint similarity (FCFP4-like Morgan 2048 bits) of the molecules with the query molecule dasatinib is visualized in the left column, the right column shows IDs of the molecule as well as the search parameters and the calculated Tanimoto similarity

Shape similarity

In order to perform a shape screening, a new database, containing a maximum of 20 conformers, was generated with the -c argument because the original database only had one conformer per compound.

figure g

Since that is a rather resource-intensive step, multiprocessing was carried out with the help of the -np parameter. The following shape search, also multiprocessed, was then done with the previously prepared vsdb pickle file using the instance coordinates of dasatinib in complex with tyrosine protein kinase ABL1 (PDB: 2GQG).

figure h

More than half of the top 10 hits were other kinase inhibitors. By default, the shape functionality creates two sd files, one with the query molecule (shape_1_query.sdf) and the found hits as a second file (shape_1.sdf). Additionally, a PyMOL session file was generated (--pymol parameter) so that the aligned structures could be visually inspected directly (Fig. 6).

Fig. 6
figure 6

Screenshot from the PyMOL session file generated after shape similarity screening. By default, the first ten hits (one of them shown here in blue) are aligned with the query molecule dasatinib (green)

The RMSD spread of the conformer generation process (ETKDG3 followed by MMFF94 minimization) is given in Fig. 7). It shows a clear upwards trend: the more rotatable bonds, the larger the RMSD.

Fig. 7
figure 7

RMSD spread of the conformer generation process (ETKDG3 followed by MMFF94 minimization) for the search of the bioactive conformation (Platinum data set)

Runtime performance

To give the user an idea of the expected runtime performance, we performed a substructure and 2D similarity search in the pdb and ChEMBL28 database [12, 28]. We performed the searches on up-to-date standard notebook hardware, namely a 12th Gen Intel(R) Core(TM) i7-12700 H with 2.70 GHz and 20 cores and 32GB RAM running Windows 11. To get an idea of the performance on your own system, you may execute the following commands accordingly. Both ChEMBL and pdb database can be downloaded and prepared directly within VSFlow:

figure i

With the above calls, the pdb and chembl databases are downloaded into VSFlow and 2048-bit ECFP4 fingerprints are generated for each compound and stored within the output vsdb file. Preparation of the pdb database (containing 36,796 unique compounds at 22/05/2022) took 11 s on our system, preparation of the chembl28 database (2066377 compounds) took 511 s. Now, we performed a substructure and similarity screening using a SMILES as query, once in single-core mode and once on 6 cores:

figure j

Table 1 summarizes the overall runtime for each call, e.g. it contains the loading time for the database file, the substructure or similarity search and the generation of the output file.

Table 1 Runtime performance of substructure and similarity search on 12th Gen Intel(R) Core(TM) i7-12700 H with 2.70 GHz and 20 cores and 32GB RAM running Windows 11

Virtual screening performance

To give the user an idea about the performance of the tool in virtual screening practice, i.e. whether it could identify active compounds, we did some basic simulated screenings using the maximum unbiased validation (MUV) dataset [29]. The MUV dataset is based on PubChem bioactivity data and consists of 17 targets, each with 30 actives and 15,000 decoys. The choice of actives and decoys is done based on confirmatory and primary screens, which makes the dataset very difficult for virtual screening methods. We performed sample screenings based on 2D fingerprint and 3D shape similarity (mode fpsim and shape). The general performance of 2D fingerprints implemented in RDKit has been studied extensively before, with the MUV dataset being part of a larger evaluation set [30]. We adapted a simplified version of the workflow described before by Rohrer [29] and Riniker [30]. In short, for each of the 17 subsets in the MUV dataset, one of the 30 active compounds was selected as query molecule and the remaining 29 actives were pooled together with the 15,000 decoys and used as validation set. This query/validation split was done for all 30 actives. For the resulting 30 query/validation test splits per subset the virtual screening performance was measured by the area under the receiver operating curve (AUC, example curve shown in Fig. 8) and the mean value was calculated for each subset (mean AUC). The screening consisted of two steps: (1) generation of a vsdb database with standardized molecules and pre-computed fingerprints or conformers for the validation set; (2) 2D or 3D similarity screening of the validation set against the query molecule.

The results for 2D similarity screening with various descriptors is summarized in Fig. 9. They follow, in general, the trend observed by Riniker et al. for 2D fingerprints on the MUV dataset [30]. For some targets, a significant enrichment of actives (e.g. meanAUC = 0.74 for ECFP3 fingerprint for target FactorXIa [MUV_846]) is observed, whereas for other targets no enrichments could be observed based on simple 2D similarty calculations.

Fig. 10 summarizes the results for the 3D shape-based virtual screenings. Best performance is observed when using the combo score for result ranking for most MUV subsets. However, for MUV_737 (estrogen receptor alpha) and MUV_832 (cathepsin G) scoring with 3D fingerprint yields a better overall enrichment.

Fig. 8
figure 8

Example of a receiver operating curve (ROC) obtained for a query/validation test split after a 2D similarity screening with ECFP2 fingerprint. The MUV subset MUV_548 was the validation set, the query was compound MUV_548_A_5. The area under the curve (AUC) is 0.744. FPR = false positive rate, TPR = true positive rate

Fig. 9
figure 9

Results of virtual screening validation with the MUV dataset for 2D fingerprint similarity. The expectation of mean AUC of 0.5 for random rankings is indicated by the blue dashed line

Fig. 10
figure 10

Results of virtual screening validation with the MUV dataset for 3D shape-based screenings. The expectation of mean AUC of 0.5 for random rankings is indicated by the blue dashed line


VSFlow is a versatile command-line tool to perform ligand-based virtual screenings in large compound databases on the basis of the RDKit cheminformatics framework. It allows to perform a substructure search, a 2D fingerprint-based and a 3D shape-based similarity search based on the respective functionalities implemented in RDKit. Screenings can be easily parallelized to multiple cores and the screening results can be directly visualized as pdf or pymol file. The integration of VSFlow in existing virtual screening setups is straightforward because the entire code is open source.

Availability and requirements

  • Project name: VSFlow - Virtual Screening Workflow

  • Project home page:

  • Operating system(s): Platform independent

  • Programming language: Python

  • Other requirements: Anaconda or Miniconda

  • License: MIT

  • Any restrictions to use by non-academics: no.