UpCoT: an integrated pipeline tool for clustering upstream DNA sequences of orthologous genes in prokaryotic genomes

UpCoT is a pipeline tool developed by automating the series of steps involved in prediction of cis-regulatory elements. UpCoT generates orthologs for each gene in target genome using bi-directional best blast hit against the reference genomes, then identifies potential orthologous transcriptional units using intergenic distance. Finally it generates the FASTA files containing upstream sequences of orthologous transcriptional units of each gene in target genome. The inputs of UpCoT are protein sequence files (*.faa), genome sequence files (*.fna) and gene co-ordinate files (*.ptt) for target and reference genomes. The clustered-upstream DNA sequences can be used by motif prediction tool, such as MEME, Bio-prospector, Gibbs motif sampler, MDscan for prediction of conserved DNA elements. We tested the performance of UpCoT by selecting the genome of Synechocystis sp PCC 6803 as the target and 13 different cyanobacterial genomes as reference. The clustered upstream sequences generated by UpCoT of groES, ycf24 and nirA were used for cis-regulatory element prediction. The results were consistent with the experimentally identified cis-regulatory elements. Therefore, UpCoT is a reliable and automated pipeline package for prediction of orthologs, orthologous transcriptional units, and orthologous upstream sequences of a selected prokaryotic genome. UpCoT can be downloaded from http://jssplab.uohyd.ac.in/upcot/.


Introduction
With the advent of fast and next generation automated DNA sequencing technologies, a number of microbial genomes have been sequenced during the past decade and the sequence information is available in various genome databases. Identification of cis-regulatory elements and the trans-acting factors of a sequenced genome is one of the major challenges to computational biologists for building a global gene regulatory network. Phylogenetic footprinting is one of the widely accepted computational method for predicting cis-regulatory elements for a given genome in question (Hardison 2000). This method can be considered as a two step process. The first step involves, identification of orthologs in the reference genomes, for each protein of a target genome by bidirectional best hit method, prediction of transcriptional units of target and reference genomes, generation of cluster of transcriptional units (CoTs), and finally clustering of upstream DNA sequences based on the generated CoT data (Wels et al. 2006). The second step involves scanning for conserved DNA elements in the clustered-upstream DNA sequences of a given CoT. Various computational tools, such as MEME, Bioprospector, Gibbs sampler, MDScan are used for predicting conserved DNA elements in a given set of DNA sequences and can be represented in the form of consensus pattern (Bailey and Elkan 1994;Liu et al. 2001Liu et al. , 2002Mrazek 2009;Neuwald et al. 1995). There are many computational tools to perform the second step of phylogenetic foot printing but, they are not available to perform the first step. Further, the first step by itself is a multi-step process and requires lengthy computational procedure. On the other hand, the number of microbial genomes being sequenced is constantly increasing and demands for the development of an automated tool. Developing such a tool would facilitate the biologists to work easily on any microbial genome for quick generation of clustered-upstream DNA sequences for the target genome in question. Keeping the above facts in view, we developed an automated integrated pipeline called, UpCoT, which identifies the orthologs for proteins of target organism (tgCoGs), generates clusters of transcription units (tgCoTs), and cluster the upstream DNA sequences of tgCoTs. The output of the UpCoT can be directly used for prediction of cis-regulatory elements using any computational tool of user's choice.

Design of UpCoT interface
UpCoT web interface was designed using HTML, PHP and javascript to select and retrieve the genomes for analysis by UpCoT package. The *.faa, *.fna, and *.ptt files of 1840 prokaryotic genomes were downloaded from NCBI (ftp:// ftp.ncbi.nlm.nih.gov/genomes/Bacteria) and incorporated in the web server, where UpCoT package has been maintained. User can select any of these genomes as target and reference files as input to UpCoT package. UpCoT package along with selected genomes for analysis can be downloaded from the web link, http://jssplab.uohyd.ac.in/upcot. The UpCoT package was developed using Perl programming language and is compatible for Windows and Linux operating systems.

Description and accessibility of UpCoT web interface
The web interface contains 'Home', 'Help' and 'Contact' links below the header. A brief introduction about the UpCoT is given in the homepage. A single click on 'Select Genomes' link navigates to a new web page displaying the list of prokaryotic genomes for selection as target genome. User can also select either windows or linux operating system, for downloading suitable executables to run the UpCoT package. Upon selecting the operating system and the target organism, a new page appears and prompts the user for selecting the reference genomes. Here the user can select any number of reference genomes for analysis and then click on the submit button. After the selection of both target and reference genomes, user can download the UpCoT package by a single click on 'Download UpCoT' tab. The UpCoT windows version needs a supporting software package GNU on windows to be installed, which is provided along with UpCoT package. UpCoT package also contains provides the instructions about how to use UpCoT package. The file 'settings.txt' provides the input parameters, as mentioned in b. The 'upcot.pl' is the main file which invokes all the Perl programs that are present in ''bin'' directory. b Snapshot showing the file contents of 'settings.txt'. E value cutoff to be used for prediction of bidirectional best hits (d = default, 1 e -3 ). User may change this value before running UpCoT. Number of orthologs present in each tgCoG. When value is set to 4, all tgCoGs containing minimum 4 orthologs and above are selected for further analysis. Max_UP_length maximum length of the upstream region to be considered by UpCoT. Configuration of the computer (32-or 64-bit). Path of GNU on Windows installation directory. Min_UP_length minimum length of upstream region to be considered by UpCoT 'settings.txt', 'README.txt', 'target_genome.txt', and 'reference_genomes.txt' files ( Fig. 1a). User may change the parameters, such as E value (default, d = 1 e -3 ), orthologs count (default = 4), computer configuration (default = 64 bit), installation path of GNU on window (in case of windows user), minimum upstream length (min_UP length default = 50) and maximum upstream length (max_UP length default = 350 bp), in ''settings.txt'' file provided in the package (Fig. 1b). The ''bin'' directory provided in the UpCoT contains Perl programs developed for performing different tasks such as blastP, extraction of top scoring hits, prediction of bidirectional best hits, counting the number of orthologs, upstream sequence retrieval, and generating clusteredupstream sequences.

UpCoT output
UpCoT identifies the orthologs for each protein of target genome by bidirectional best hit method (BDBH) in the given reference genomes using BlastP (Altschul et al. 1990). After performing BDBH, UpCoT generates clusters of orthologs groups for target genome (tgCoGs) based on the orthologs count. For example, if the orthologs count is set to 4, tgCoGs containing four orthologs or above will be selected for further analysis. A directory named as ''tgCoG_protein_sequences'' is generated containing FASTA files of tgCoG protein sequences. In addition, UpCoT generates clusters of transcriptional units (tgCoTs) by reading the tgCoG files and also based on the length of their corresponding upstream DNA sequences. The minimum length of the upstream region (min_UP length default = 50) and the maximum length of the upstream region (max_UP length default = 350). It excludes the upstreams of the open reading frames, which are less than the defined nucleotide length. Reports suggest that the genes possessing an upstream region less than 40-50 bp are to be excluded from the computational prediction of cis-regulatory elements, we have set the minimum default integer value as 50 bp in the ''settings.txt'' file ( Fig. 1b) (Conlan et al. 2005;Liu et al. 2008;Salgado et al. 2000). User may change the minimum length as per the requirement. When the actual length of the upstream region is greater than the minimum default length or user-defined minimum length, the program selects 350 bp upstream region of an ORF. When the upstream intergenic region is longer than 350 bp, the UpCoT considers only 350 bp upstream region, as the max_UP length is set to 350 bp. User may also change the max_length according to the requirement. Subsequently, UpCoT extracts and clusters the upstream DNA sequences of tgCoTs based on default or user defined integer values as upstream length given in ''settings.txt'' file. Upon completion of the whole process, a directory named ''tu_upstreams'' appears in the working directory of the UpCoT. This directory contain multiple text files, each with clustered-upstream DNA sequences of a tgCoT. Each text file is named with ORF number of the target gene. User can submit these upstream sequences for Fig. 2 The schematic representation of the UpCoT input, UpCoT work flow and UpCoT output. The inputs for UpCoT are *.faa, *.fna, *.ptt files of target and reference genomes of user's choice. UpCoT uses these files to generate tgCoGs by Bidirectional best hit method (BDBH) and the clusters of transcriptional units (tgCoTs). UpCoT groups the upstreams of each gene of a tgCoT to generate clustered-DNA upstreams of that tgCoT. All clustered-DNA upstreams of each tgCoT are saved into 'tu-upstreams' directory. Each output file is a text file named with 'Up-ORF id' of the target organism. UpCoT also generates the tgCoG protein sequences as text files. G1 tgCoT of gene 1, P1 tgCoG of gene 1, Up-Gn clustered-upstream sequences of gene 'n' of a target genome any motif prediction tool for identifying cis-regulatory elements. The entire work flow of UpCoT including inputs, the processes, and the outputs are depicted in (Fig. 2).  1 e -3 ), ortholog count as 4 and min_UP length as 50 and max_UP length as 500 bp for testing UpCoT. From the output generated by UpCoT, the text files with names Slr2075, Slr0074 and Slr0898 were selected from 'tu_upstreams' directory and submitted for the prediction of cis-regulatory elements using stand alone versions of MEME, Gibbs Motif Sampler, MDScan and Bioprospector.

Performance analysis of UpCoT
The target genome, Synechocystis has 3172 open reading frames that code for proteins involved in various cellular processes and unknown proteins. Out of 3172 proteins, UpCoT has generated 2578 tgCoGs, each containing minimum four orthologs. This shows that 81 % of Synechocystis proteins are present in at least four selected cyanobacterial species. Orthologs identified by UpCoT for the selected proteins were retrieved from 'tgCoG_pro-tein_sequences' directory and tested for accuracy. Table 1 shows the selected proteins and their orthologs along with their functional annotation. From (Table 1), it is clear that the orthologs identified by UpCoT for the proteins Slr2075 (GroES), Slr0074 (Ycf24), Slr0898 (NirA), Ssl2598 (PsbH), Smr0009 (PsbN), Sll0851 (PsbC) andSll0894 (PsbD) are accurate because their annotations are same as given in NCBI genome database.

Analysis of clustered-upstream DNA sequences for selected tgCoTs
UpCoT has generated 2578 text files each containing clustered-upstream DNA sequences of a tgCoT. Out of which, the clustered upstream DNA sequences of slr2075, slr0074 and slr0898 were submitted for cis-regulatory element prediction, as the regulatory elements for these genes were previously experimentally demonstrated to be the target sites for known transcription factors. The clustered-upstreams were submitted to four different motif prediction tools as described in the materials and methods. (Table 2) shows the predicted cis-regulatory elements which were identified in the clustered-upstreams of the above selected tgCoTs. In Synechocystis the gene slr2075 encodes for co-chaperonin GroES. HrcA, a transcriptional repressor has been reported regulate the expression of the groESL operon by binding to a 9-bp inverted repeat TTAGCACTC [N9] GAGTGCTAA (Nakamoto et al. 2003;Zuber and Schumann 1994). When the clustered-upstream sequences of slr2075-tgCoT was submitted as input to motif prediction tools the same inverted repeat was predicted by MEME, Gibbs motif sampler and Bioprospector ( Table 2). The SufR is a negative transcriptional regulator of sufBCDS operon in Synechocystis. SufR binds to cis-acting element, CAAC-N6-GTTG located between the divergently transcribed sufR gene and the sufBCDS operon, and acts as a repressor of the sufBCDS operon and as an auto-regulator of its own gene, sufR (Wang et al. 2004). Motif prediction tools MEME, MDScan and Bioprospector generated the same element upon submission of clustered-upstream sequences of slr0074-tgCoT (Table 2). A number of nitrogen assimilation genes are regulated by the global transcriptional regulator NtcA, that acts as both an activator and repressor (Aichi et al. 2001). The binding site of NtcA is reported to be a tri-nucleotide inverted repeat GTA N(8) TAC. The ORF, slr0898 codes for Ferredoxin-nitrite reductase (NirA) in Synechocystis. Bioprospector tool has predicted the NtcA binding site in the clustered-upstream DNA sequences of tgCoT-slr0898 (Table 2). Thus, based on the identification of experimentally validated cis-regulatory elements for clustered upstreams oftgCoTs, we suggest that UpCoT is suitable for extracting and clustering of upstreams for any group of microbial genomes with accuracy and can be used for phylogenetic foot printing, promoter prediction, sRNA mapping and TSS prediction.

Conclusion
UpCoT is an automated software that can perform prediction of bidirectional best hits, clusters of transcriptional units (tgCoTs) and grouping of upstream DNA sequences for the predicted tgCoTs in a single step. It can be used as a tool by biologists to work on available microbial genomes Synechocystis was used as target and other selected cyanobacterial species were used as reference organisms. Functional annotation is given in parenthesis and is based on NCBI genome database (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria) for prediction of cis-regulatory elements using phylogenetic foot printing. UpCoT can be downloaded from http:// jssplab.uohyd.ac.in/upcot/.