Background

Genes working cooperatively in a metabolic pathway are often physically co-localized in prokaryotic and fungal genomes. These gene clusters are commonly observed in specialized metabolism involved in ecological adaptations, such as nutrient utilization and production of virulence factors. In particular, Biosynthetic Gene Cluster (BGCs) that code for specialized metabolites has gained significant interest due to their major role in modern society as a source of pharmaceutical drugs (e.g. antibiotics) and crop protection chemicals [1, 2]. These loci not only contain genes responsible for biosynthesis but often include auxiliary regions coding for regulatory and transporter proteins [2, 3]. Using signature genes and machine-learning-based methods, several computational frameworks have been developed to effectively detect hypothetical BGCs from genomic data, such as ClusterFinder, PRISM, DeepBGC, and antiSMASH [4,5,6,7]. With these mature pipelines and the increase in publicly available genomes, a vast number of BGCs, both experimentally verified and hypothetical, have been catalogued in several databases. These include MIBiG, antiSMASH-DB, BiG-FAM, ARTS-DB, and IMG–ABC [8,9,10,11,12]. Unfortunately, much of this data remains unannotated. For instance, as little as 0.3% of the ~ 400,000 BGCs in IMG–ABC v5 are experimentally validated. Comparative genomic analysis can shed light on the functions of BGCs and their underlying genes. However, accessible online tools to allow scientists to perform custom comparative genomic analyses are lacking.

Gene cluster analysis methods for homology grouping, search, and visualisation are essential tasks to effectively leverage the available public resources. While tools such as BIG-SCAPE, BiG-SLiCE, MultiGeneBlast and cblaster aid in gene cluster analysis, these demand local computational resources or require command-line experience [13,14,15,16]. Due to the technological barrier, there is a need for a user-friendly and accessible platform for performing these analyses. Additionally, downstream methods for interpreting these results are often required. Visualisation and comparative genomic tools such as clinker and CORASON are capable of highlighting synteny or evolutionary relationships between BGCs; however, these also require expertise to operate and are not easily connected to homology search results [13, 17]. To remedy this problem and provide an accessible, “BLAST-like” web server for gene clusters, we present CAGECAT (the CompArative GEne Cluster Analysis Toolbox).

The CAGECAT web server enables researchers to execute a full gene cluster analysis pipeline using customizable BLAST searches on up-to-date genomic databases. The service provides seamless connections between the search and visualisation modules, enabling execution, inspection, and fine-tuning of relevant search results. While some multi-gene search portals exist, such as ClusterScout and antiSMASH-DB, these only provide for model-based searching (e.g. Pfam) on predefined genome datasets, which often lag behind rapidly growing public genomic databases [9, 18]. In addition to providing more up-to-date results, leveraging BLAST homology allows for refined control compared with model searches (e.g. identity and coverage), which can lead to more specific matches that aid in annotation, taxonomic distribution, or gene cluster evolution. Furthermore, ​​with the interconnection of modules a user can accelerate result curation and downstream analysis, e.g. using gene neighbourhood estimation output to adjust intergenic distance thresholds to obtain more relevant matches. To our knowledge, we present the first free and publicly available web server for accelerated curation of homologous gene clusters with integrated downstream interpretation. By broadening accessibility of gene cluster analysis methods we hope this will lead to accelerated analysis and annotation of BGCs and contribute to the general knowledge of their subsequent products.

Implementation and available tools

The aim of CAGECAT is to provide a platform to seamlessly connect gene cluster analysis tools in an accessible web server for search and interpretation of results. To provide this service, CAGECAT implements a queue system that allows parallel job submissions which is supported by the python ‘rq’ library and Flask web-server (see Additional file 1). The search module leverages the cblaster pipeline, which utilises remote BLAST searches via NCBI’s servers as well as accelerated local Hidden Markov Model (HMM) based searches. Besides rapid similarity searches of entire BGC regions, cblaster provides several functions for gene neighbourhood estimation (GNE), sequence extraction, and visualisation (see Gilchrist et al. for a detailed description of methods) [16]. The clinker pipeline is currently used for the visualisation module, which provides automated cluster alignment and homology annotations. CAGECAT has been designed to provide rapid interoperability between these functions, where homologous clusters of interest can be selected to be used in subsequent analysis. A graphical summary of tool interoperability is given in Fig. 1.

Fig. 1
figure 1

Interoperability scheme of implemented functionality on CAGECAT. Blue outlined rectangles indicate entry points. Arrows indicate available downstream analyses from a module. Currently, a cblaster search/recompute job can be used for every downstream module, excluding a recompute job from being recomputed again. The clinker tool has no downstream analyses. For example, a possible workflow could be: cblaster search to cblaster recompute to cblaster plot clusters to selective clinker visualisation. This allows for fine-grained control of relevant matches for final visualisation and greatly improves user processing time

Databases for hidden markov model (HMM) searches

Searches for homologous gene clusters based on HMM profiles using cblaster require cblaster-generated HMM databases. Genus-specific Pfam databases were generated as detailed in supplemental methods resulting in 70 genera with 10 or more genomes for fungi, and 43 genera with 50 or more genomes. A custom script to fetch representative and reference genomes of prokaryotes and fungi was made using NCBI’s e-search utilities [19]. To maintain CAGECAT’s free accessibility and storage, researchers will be required to use the command line version of cblaster or a local installation of CAGECAT to utilise custom HMM databases.

Job management

CAGECAT manages job submissions through a queue submission system, which processes jobs in a parallelizable first-in-first-out manner. Remote BLASTp queries are submitted to the NCBI API which leverages a scalable infrastructure allowing for multiple simultaneous searches (~ 10 requests/sec with an API key). By default, up to 15 jobs can be run in parallel to ensure stability and throughput. Upon job execution, the job command is constructed with the user-defined values of the input parameters and the appropriate pipelines are executed via Python. All output files are then stored and saved using a uniquely generated job ID. See supplemental methods for further technical details.

Results and user interface

Input and output

Two entry points for queries are currently implemented in CAGECAT for either gene cluster search via cblaster (search module) or visualisation via clinker (visualisation module). Input and output for other implemented modules are shown in Table 1.

Table 1 Current entry points of CAGECAT and their inputs and outputs

The search module allows for local files in either GenBank or FASTA format (protein sequences) to be uploaded and processed by the cblaster pipeline. Additionally, NCBI accession numbers can be used to submit a search query on the NCBI database, which can be combined with local searches using HMM profiles in predefined databases on CAGECAT. The input page (Additional file 1: Figure S1) also contains optional parameters for selection of remote databases, search behaviour, and clustering of results. For the visualisation module, users can upload several genbank files or directly use outputs from the search module.

After completion of remote NCBI searches, users are presented with a cluster heatmap, which displays the absence/presence of each query protein sequence across the genomic hits (Fig. 2A). As in the original cblaster, the results are sorted and colored based on BLAST similarity and number of matching proteins to the query cluster for rapid identification and comparison of homologous gene clusters across genomes. For the visualisation module, clinker will generate interactive gene cluster comparison figures with links drawn between similar genes on neighbouring clusters and shaded based on sequence identity (Fig. 2B). Further details of these modules can be found at https://cagecat.bioinformatics.nl/tools/explanation and several example case studies for the cblaster output can be found in Gilchrist et al.

Fig. 2
figure 2

Example output of CAGECAT’s entry point. Both modules create an interactive HTML visualisation which is displayed on each output page. A cblaster search: hit clusters are shown in a dendrogram (based on identity to query sequences). A darker tint of blue resembles a higher percentage identity of the query in the output cluster; B clinker visualisation: genes within a gene cluster are color-coordinated. Similar genes found in multiple clusters have links drawn between and are shaded based on sequence identity

Features and interoperability

Users can download job results to their local computer within 30 days and output HTML files are displayed in-browser allowing for interactive inspection of results. The search module output allows for manual gene cluster selection to further curate results, which can be directly exported as genbank sequences. To accelerate analysis, CAGECAT provides interoperation between results and the available modules. Selections of output from the search module can be directly used as input for downstream analysis (e.g. to selectively visualise some results) or to recompute a search using different parameters (Fig. 3). Notably, when genomic regions from the search module are used for analysis in the visualisation module, it will include all genes present within each genomic region that were not specified in the search query.

Fig. 3
figure 3

Post-job execution screen for selective downstream analysis. 1: buttons to download results and save the current webpage to the browsers bookmark.; 2: available downstream analyses for the current analysis. Selected clusters and/or queries are temporarily saved when navigating to a downstream module; 3: manual selection of clusters for downstream analyses. Clusters/queries can be selected by moving them to the selected field using shown buttons. Available for cblaster search, recompute and plot clusters modules

Runtime and scalability

Remote search times are largely dependent on NCBI services which cannot be definitively benchmarked due to dependency on service traffic. However, processing of 346 queries over the 5-month user testing period showed an average search completion time under 8 min. Other functions such as clinker visualisation, recompute, gene cluster neighbourhood estimation, and cluster extraction all showed negligible processing time under 30 s (Additional file 1: Table S1).

Conclusions and future directions

With CAGECAT, we aim to lower the technical barrier to execute gene cluster analysis. Downstream analyses can be rapidly performed using the results of a previously executed job, which accelerates curation and comparative visualization. This service enables a quick search of whole gene cluster sequences against NCBI non-redundant or RefSeq databases that can be confined to a selected genus. Currently, two entry points exist to start analysing on CAGECAT: (I) finding homologous gene clusters using a query cluster and the cblaster search module, and (II) a visualisation of gene clusters using a set of query clusters and the clinker module. CAGECAT does not impact or interfere with the analysis capabilities of the implemented tools and acts as a bridge to allow for rapid retrieval of homologous gene clusters from continually updated public databases. We foresee CAGECAT being used by a wide audience to easily uncover homologous BGCs and provide publication-quality visualisations without the need for computational resources or programming expertise. The service is also built to be extensible so that additional downstream analyses can be connected in future versions. Suggestions and comments sent via the contact page will be carefully considered during development. Furthermore, CAGECAT is also useful for comparative analysis and discovery of gene clusters beyond those that encode the production of specialized metabolites, such as xenobiotic degradation pathways [20]. Considering the remote database has no restriction to any particular taxa, this service can thus be used for general homology searches beyond those detailed in this manuscript on a variety of genomes (e.g. Human, mouse). Inter-taxa results are also possible with lower homology thresholds set in the advanced options. With this web server, we aim to accelerate comparative analysis of gene clusters and provide an easy-to-use interface to help uncover clues for further study of BGCs encoding useful specialized metabolites as well as a starting point for investigating gene cluster evolution.

Availability

Project name: Comparative Gene Cluster Analysis Toolbox (CAGECAT).

Project home page: https://cagecat.bioinformatics.nl

Operating system(s): Linux / Platform independent via Docker.

Programming language: Python.

Other requirements: Python 3.8, Docker.

License: MIT.

Source code: https://github.com/malanjary-wur/CAGECAT