Background

Characterization of bioactive molecules produced by free-living microorganisms has been very important in recent years because of their biotechnological applications. It is well known that the overwhelming majority of free-living microorganisms are not capable of being grown in laboratory conditions [1], which is a bottleneck to the identification and isolation of bioactive compounds. Thus, an alternative to search for new compounds is the well-established method of WMS, where the nucleic acids of the microbial community are extracted and sequenced directly from environmental samples [2]. Thus, genes involved in the synthesis of peptides or non-peptides bioactive compounds can be assessed. The main bottleneck lies in the development of user-friendly tools that allow the user to analyze a large amount of data in a simple and interactive way.

Several genes and molecules were prospected by WMS such as amylolytic or cellulolytic enzymes [3], antimicrobial compounds [4], antibiotic resistance genes (ARGs) [5,6,7] and bacteriocins [8]. Bacteriocins are small cationic thermostable bacterial peptides with a narrow spectrum of activity [9,10,11]. Unlike antibiotics, bacteriocins are produced by ribosomal activity and therefore are protease-sensitive. They are divided into three classes according to their synthesis mechanism.

Class I consists of peptides that after their translation, undergo structural changes. This class is also called lantibiotics. They have a molecular weight below 5 kDa and a size smaller than 28 amino acids [12, 13]. Class II is characterized by peptides that do not undergo post-translational modifications. They are larger than class I bacteriocins and have a molecular weight below 10 kDa [14]. Class III is composed of peptides with a molecular weight higher than 30 kDa. Bacteriocins of this class have a mechanism of action different from the other two classes, eliminating bacterial cells through cell wall hydrolysis [12, 15, 16].

A variety of software has been developed to search for ARGs or secondary metabolites such as nonribosomal peptide synthase (NRPS) or polyketide synthase (PKS) in WMS data [6, 17]. However, none of this software is focused on the prospecting of bacteriocins. Anti-SMASH [18] is an excellent tool to analyze genomes while RiPPER [19] works better for pan-genome data. BAGEL web tool (BActeriocin GEnome mining tooL) [17] is one of the first tools developed for the identification of peptides and bacteriocins in genome data. However, the tool has a maximum size for the input file, making difficult the analysis of WMS data.

In this article, we present BADASS software (BActeriocin-Diversity ASSessment Software), an automated pipeline with an intuitive graphical interface that allows users to analyze the diversity of bacteriocins using WMS raw data. Diversity measurement is based on the abundance and richness of the three bacteriocin classes currently described. The software is available at https://sourceforge.net/projects/badass/.

Implementation

Pipeline

The pipeline of BADASS (Fig. 1) starts with the automatic loading of bacteriocin sequences to the database. It is worth noting that this process needs to be executed only on the first use of the software. The input file consists of a WMS sequencing sample in FASTQ format. After saving the project the following steps are performed.

  1. (a)

    Quality assessment The user can choose to evaluate the raw data with a boxplot chart that correlates the Phred score of a base (y axis) with base position (x axis). This is an optional step that helps users to decide about the quality filter values that will be used in the next step. A boxplot with the result of the FastQC analysis is produced at this stage and displayed to the user.

  2. (b)

    Trimming and quality filter Raw data is trimmed to remove bases at the end of the reads with a Phred score below the cut-off value provided by the user. Subsequently, sequences are filtered according to parameters such as alignment score and e-value. The Fastx Toolkit software is used in this step.

  3. (c)

    Parser fastq2fasta The trimmed and filtered file is converted into FASTA format.

  4. (d)

    Mapping against the bacteriocin database In order to identify bacteriocins, we adapted a search method used in several works [20, 21] which consists of: Firstly, a database is built using non-redundant BAGEL4 sequences [22]. Subsequently, the BLAST+ tool [22, 23] is used to compare the translated reads against the BAGEL4 database with blastx. The best hit for each read identified as bacteriocin is retrieved. The user can define an e-value cut-off for the homology analysis.

  5. (e)

    Mapping against the 16S rRNA SILVA database The same file of the previous step is used to align the reads against the SILVA database using DIAMOND [24]. Two files in.csv format are generated. The first contains the list of subject nucleotide sequences with their respective identity values. The other file contains the best hit for each query sequence based on an identity cut-off value provided by the user. Cut-off values are adjustable in the graphical interface of BADASS.

Fig. 1
figure 1

Diagram of the software pipeline

The number of reads identified as bacteriocins and 16S rRNA are used to calculate the richness and abundance of bacteriocins in the WMS dataset as mentioned later.

Programming language and database

BADASS was developed in JAVA (https://www.oracle.com) and used the Maven tool (https://maven.apache.org/) to build and manage the project. Maven was used due to the automated management and generation of the JAR package containing the software dependencies. Swing library was used to produce the graphical interface. The database management system used to control the steps and manage the project was SQLite v.3 (https://www.sqlite.org/).

Data source and software validation

The software validation was performed using four whole-metagenome shotgun sequencing datasets. Samples were obtained in the Tucuruí Hydroelectric Power Plant water reservoir submitted in EBI database under the accession numbers ERS1560860, ERS1560861, and ERS1562591 [8] and a sample obtained from Unai's Hot Spring from the ENA (European Nucleotide Archive) database with the accession number PRJEB8864. The following parameters were used: Quality threshold: 20, minimum length: 100, minimum quality score to keep: 16, minimum percent: 80%, e-value: 10, threads: 6, identity: 50.

Quality assessment and BLAST

BADASS uses the statistical platform R v.4 (https://www.r-project.org) for quality assessment of the raw data, through the FastQC package (https://github.com/kassambara/fastqcr). Reads are trimmed and quality filtered using the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). The user is able to adjust the parameters of quality filtering in the graphical interface of BADASS. BLAST v.2.0.9-3 and Diamond v.2.0.7.145 software are used to map the reads against the BAGEL4 [22, 25] and SILVA [26] databases, respectively. BAGEL4 was updated and now has about 500 Ribosomally synthesized and post-translationally modified peptides (RiPPs) (class 1), 230 non-modified bacteriocins (class 2), and 90 bacteriocins with more than 30 kDa (class 3). RippMIner [27] and the data repository MIBiG [28] were used to build the BAGEL4 database.

Abundance analysis

Diversity of bacteriocins was analyzed in terms of abundance and richness. In order to calculate the abundance of bacteriocins we adapted the formula proposed formula in studies involving the search of abundance of resistance genes in WMS data. [20]. Where: (1) n is the amount of bacteriocins that were found in sample; (2).Nbacteriocin sequences is the number of reads that mapped to a specific bacteriocin; (3) Tread is the average size of reads; (4) Tbacteriocin is the average size of the bacteriocin; (5) N16S rRNA sequences is the number of reads that mapped to 16S rRNA sequences; and (6) T16S rRNA sequence is the average size of the 16S rRNA sequence.

$$Abundance = \mathop \sum \limits_{1}^{n} \frac{{N_{bacteriocinsequences} \times \frac{{T_{read} }}{{T_{bacteriocin} }}}}{{N_{16SrRNAsequences} \times \frac{{T_{read} }}{{T_{16SrRNA} }}}}.$$

Workstation

Analyses were performed in a Desktop equipment Intel® Core™ i7-10510U CPU @ 1.80 GHz with 8 processing cores, 16 GB of RAM memory, and tests were run on Ubuntu 21.10, 64-bit, Windows 11 and macOS Ventura 13.0 operating systems.

Results and discussion

BADASS was developed using the BLAST and DIAMOND alignment tools to identify bacteriocin and 16S rRNA sequences in WMS raw data. The choice of tool can be defined in the BADASS GUI. Other studies have used similar methodologies for prospecting relevant genes such as ARGs [7, 20, 29, 30]. For example, ARGs-OAP is an online pipeline for antibiotic resistance genes detection in metagenomic data through similarity sequence analysis [5]. In addition to the homology search, BADASS calculates the abundance of each bacteriocin class by taking into account the size of the genes and the size of the reads produced by the sequencing library [22]. The number of reads identified as bacteriocins are compared to the number of reads identified as the gene of the 16S rRNA, which is present in a few copies per cell. This approach makes the size of the reads as well as the size of the genes not interfere with the analysis of abundance. Thus, this pipeline is a powerful tool for rapid and comprehensive evaluation of bacteriocin diversity using WMS raw data.

The main results of BADASS are the description of richness and the values of the abundance of bacteriocins using WMS raw data as input file in a simple and intuitive way. The software provides a set of adjustable parameters in the graphical interface. Users can also choose to process the samples using default parameters. In more detail, the results obtained by the software include: (1) quality assessment box plots of the raw data directly in the graphical interface or even in the results folder; (2) spreadsheets in.xlsx or.csv format containing information about the identified bacteriocins (richness) including their frequency (ratio between the number of reads identified as bacteriocin and the total number of reads in a sample) and abundance (calculated using the formula previously mentioned), organized by class; (3) bar plots of abundance based on the.csv files; (4) a.csv file containing the list of 16S rRNA sequences identified in the dataset including the value of percentage identity; (5) Trimmed.fastq and QV.fastq files containing the trimmed reads and the reads filtered by minimum size, respectively.

It is also possible to detect bacteriocin in genome sequences using other software. Table 1 presents several computational tools and databases developed to help in the identification of these antimicrobial peptides. The main features of each software or database are compared in the table. It is worth noting that BADASS is the only who has a graphical interface, supports WMS data, and performs diversity analysis (Table 1). BAGEL4 stands out for having one of the most complete databases containing a large number of annotated and experimentally verified bacteriocin sequences. In addition, the database is divided into three classes according to the genetic information and mechanism of action. Because of these features, BAGEL4 was used as a reference bacteriocin database in BADASS. BACTIBASE [30] is a database containing detailed information about the physicochemical properties of bacteriocins. This information allows a fast and accurate prediction of the structure–function relationship and possible target organisms of the antimicrobial peptides. Other relevant software includes BOA (Bacteriocin Operon Associator) [31] which uses Hidden Markov Models to predict bacteriocin clusters, Neubi [32] which identifies bacteriocins using a word embedding approach, and Anti-SMASH (Antibiotics and Secondary Metabolite Analysis Shell) which was launched in 2011 and is used not only for bacteriocin prediction but for a number of other secondary metabolites [18, 33].

Table 1 Comparison of the main features of software used for bacteriocin gene mining

The identification of bacteriocins, however, is still quite challenging due to the limited number of known and experimentally analyzed sequences. Choosing the most appropriate and up-to-date tool is essential for the search and identification of bacteriocin genes. BADASS is a user-friendly software, with a robust pipeline that starts with the quality assessment of the raw data and ends with the analysis of the richness and abundance of bacteriocins.

A pilot analysis was performed using four datasets with default parameters. The results of the dataset ERR1816708 are presented in Table 2. First column of the table shows the BAGEL4 accession number and name of the bacteriocins. Other columns correspond to frequency, abundance and class, respectively. Thus, users are able to identify the diversity of bacteriocins in the dataset. Additionally, complementary analyses such as taxonomic affiliation of the microbial community are important to determine the ecological context of the described bacteriocins [8].

Table 2 CSV file generated by the BADASS presenting the names and other information of the detected bacteriocins

Two bar plots were generated by the software containing an overview of the bacteriocin diversity. The first plot (Fig. 2) is designed based on a.csv file similar to Table 2. The plot presents the top ten most abundant bacteriocins in the dataset. Information about the bacterial species that commonly produce the peptides are presented in the legend. The second plot (Fig. 3) presents the abundance of bacteriocins by class. The best parameters for each study should be carefully chosen by the user according to it dataset characteristics. A variation in the results is expected since the parameters adjust the analysis performed by the software. The choice of parameters by the user will result in changes in the result, being able to restrict to stricter or looser parameters [34].

Fig. 2
figure 2

Output file of BADASS. Bar plots for the four datasets analyzed presenting the top ten most abundant bacteriocins

Fig. 3
figure 3

Output file of BADASS. Bar plots presenting the abundance of each class of bacteriocin in the four datasets analyzed

The time required for similarity search and post-alignment analysis has become a bottleneck as sequence costs decrease and the size of the datasets increases [35]. We also highlight that all the analysis, starting from the filtering of the raw data can be done in the BADASS pipeline. The software allows users to modify most of the parameters such as e-value, identity cut-off, and others.

Conclusions

In the environment, a countless number of microbial species coexist and, in order to succeed in colonize their ecological niches, many have developed mechanisms to eliminate other species through the production of antimicrobial molecules. In this chemical warfare, bacteriocins are narrow-spectrum antimicrobial peptides synthesized by ribosomal activity that are widely distributed in bacterial species. Thus, the development of computational tools to identify, classify and quantify bacteriocins in WMS datasets is of great importance for microbial ecology and biotechnology.

BADASS provides the user with a robust and automated computational tool with a simple and intuitive graphical interface, where the parameters can be adjusted by the user, allowing greater independence in the analysis of different samples. The integration of the software with the R statistical platform allows the generation of plots that helps in data interpretation. For those looking to prospect antimicrobial peptides in WMS raw data, BADASS is a powerful solution.

Availability and requirements

Project name: BADASS

Project home page: https://sourceforge.net/projects/badass/

Operating system(s): platform independent

Programming language: Java

Other requirements: e.g. Java 19.0.1 or higher

License: GNU GPL

Any restrictions to use by non-academics: license needed.