Background

Transcription factors (TFs), regulators of gene transcription, are critical for most if not all biological processes. They possess DNA-binding domains (DBDs) that can recognize specific DNA sequences to mediate the TF-DNA interaction. By binding to functional DNA elements, such as promoter, enhancer, silencer, and insulator, TFs can activate or suppress the transcription of their target genes and precisely control the cell phenotypes and functions. The TF repertoires of many species throughout the tree of life have been available in several TF databases, most of which identify TFs from genome sequences. For example, DBD (transcription factor prediction database) predicts TFs from 930 completely sequenced genomes, in Archaea, Bacteria, and Eukaryota [1]. CIS-BP (Catalog of Inferred Sequence Binding Preferences) covering 290 eukaryotic genomes accrue DNA binding motifs based on the experimental identification and computational inference [2]. Other TF databases that focus on specific taxonomic groups are also mainly based on genome sequences. For instance, AnimalTFDB recorded TFs from 65 animals, mostly vertebrates [3], while FlyFactorSurvey [4] and FactorBook [5] focus on the binding specificity of TFs of fruit fly, human, and mouse. However, as only a few crustacean genomes have been sequenced to date, among the mentioned TF databases, only CIS-BP covers two crustacean species, while DBD contains one. Consequently, critical knowledge concerning crustacean TFs, including their sequences, functions and evolution, have been rarely explored.

Crustaceans are a diverse group of animals of ecological and commercial importance all over the world; including many species of high economic values in fisheries and aquaculture, particularly the decapods such as crabs (Brachyura), lobsters (Astacidea and Achelata), crayfishes (Astacidea) and shrimps (Caridea and Dendrobranchiata). The total production of crustacean farming in 2015 has increased to nearly 14 million tons, with an average annual increase rate of 3.75% in recent years [6]. About half of all crustacean products are produced by aquaculture, and the other half relies on capture fishery [6]. The proportion of crustaceans among all aquaculture animals has increased from less than 5% before 2000 to close to 10% in 2015, valued at over US$38 billion [7]. The escalating harvest of wild stocks of commercially important decapod crustaceans for fishery and aquaculture has also incited public concern about the sustainability and the detrimental environmental impact of such practice. A few studies have already documented the negative effects of crustacean fishery and aquaculture on the environment and crustacean biodiversity [8, 9]. Besides the economic species, many small crustacean species are important components in various ecosystems. Many of them, for instance, the krill (Euphausiacea) and copepods (Copepoda), are a major food resource for many marine faunae, serving as important trophic links between the primary producers and the macrofauna [10]. Others, such as amphipods (Amphipoda) and Daphnia (Cladocera), are often used as bioindicators to assess the impact of human activity and environmental changes on ecosystems and biodiversity [11,12,13]. Research on the growth, reproduction, and development of crustaceans is particularly crucial to fisheries, aquaculture, biodiversity conservation and environmental protection. Given the importance of TFs in these biological processes, investigations on crustacean TFs become urgent for the improvement of crustacean fisheries and aquaculture, and mitigation of biodiversity degradation and loss.

Our current knowledge on crustacean TFs is mainly derived from low-throughput experiments, and restricted to only a few TFs in a few species [14,15,16,17]. A system-wide exploration on the repertoire of crustacean TFs will shed light on the complexity and diversity of crustacean transcriptional regulatory systems. Fortunately, despite the lack of genome sequence for most crustacean species, a large repertoire of transcriptomes of many crustacean species emerged in recent years provides us the opportunity to predict TFs from assembled transcripts on a transcriptome-wide scale. We collected the publicly available and newly sequenced crustacean transcriptomes in our laboratory and searched all TFs in the transcriptome assemblies. To disseminate our results, we constructed a database CrusTF, in which coding sequences (CDSs), protein sequences, DBDs, DNA binding motifs and phylogeny of crustacean TFs could be freely accessed by researchers. Compared to current TF databases containing only a few crustaceans, CrusTF includes TFs from 170 crustacean species. It is the first TF database derived from transcriptome data. It will serve as a model to fill the knowledge gap of TF genes throughout the tree of life for those species of which transcriptomes are more readily available than genomes.

Construction and content

CrusTF is a database of crustacean TFs mainly derived from de novo transcriptome assemblies. It has explored a comprehensive collection of crustacean transcriptomes to identify transcribed crustacean TFs. It allows users to search TFs with keywords matching TF names or TF identifiers, to select species and TF family of interest, or to search by Blast tools with their own TF sequences. Free batch download is available for the TF CDSs, protein sequences and domain sequences of selected species or TF families. Besides TF sequences, CrusTF also contains functional and evolutionary information of crustacean TFs.

Database implementation

CrusTF implements a Linux-Apache-MySQL-PHP (LAMP) system. All data were saved in MySQL database, including TF sequences, TF information, domains, species information, and motifs. The web is constructed based on CodeIgniter, a powerful PHP framework. CodeIgniter provides an Application Programming Interface (API) to connect the web to MySQL database. We also used JavaScript libraries including jQuery (2.2.0), jQuery-labelauty and some additional plugins to perform dynamic web services.

Data resources

Crustacean transcriptomes were downloaded from two public databases, National Center for Biotechnology Information (NCBI) Transcriptome Shotgun Assembly (TSA) database (https://www.ncbi.nlm.nih.gov/genbank/tsa/) and Short Read Archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra). The detailed information of all transcriptomes from SRA and TSA is summarized in Additional file 1: Tables S2 and S3, respectively. As of January 2017, our transcriptome collection composes of 919 and 122 crustacean transcriptome samples curated from SRA and TSA databases, respectively (Additional file 1: Tables S2 and S3). 37 RNA-seq samples of 31 crustacean species generated in our laboratory were also included in the current version of CrusTF (Additional file 1: Table S1 and unpublished data of Ma et al.). Additional file 1: Table S1 lists the transcriptome data sources of each crustacean species available in CrusTF. Besides, crustacean genes, including those derived from low-throughput experiments, were downloaded from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) (Additional file 1: Table S4).

Data processing

For transcriptomes in TSA, assembled contigs were directly downloaded from the database. For transcriptome data in SRA, raw reads of RNA sequencing (RNA-seq) were downloaded. To standardize the data processing procedure, only RNA-seq data generated using Illumina sequencers were selected. Quality of raw reads was assessed by FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Adapters were trimmed by Trimmomatic [18], allowing two seed mismatches to tolerate sequencing errors in adapter sequences. Subsequences of reads with low quality (average quality score lower than 20) were removed. Trimmed reads shorter than 50 nucleotides were also deleted. Processed raw reads were assembled de novo using Trinity version 2.4.0 with the default setting to obtain contigs of transcript sequences [19]. Potentially contaminated sequences from bacteria, virus or archaea were filtered by Kraken [20]. Contigs from multiple transcriptomes of the same species are clustered and further assembled by TGICL with an identity cutoff of 0.94 to reduce redundancy [21]. CDSs and amino acid sequences were deduced from assembled contigs, as well as sequences from GenBank, with Transdecoder [22].

Functional annotation of TFs

DNA binding domains (DBDs) were identified by scanning all crustacean proteins with HMMs of known DBDs from Pfam database [23] by PfamScan that implements HMMER3 [24]. Predicted TFs with DBDs were compared to known TFs well annotated in CIS-BP and named by the best matched known TFs.

DNA binding motif of each TF was inferred as described previously [2]. In brief, DBDs of crustacean TFs were compared with those of TFs with known binding motifs, and similarity between crustacean TFs and known TFs was calculated. Based on the observation on co-evolution of DBD sequences and their DNA binding motifs, it has been reported that the TF binding motifs of a TF could be inferred from those of homologous TF DBDs when the identity between the two DBDs is greater than a threshold [2]. The thresholds of all TF families could be found in CIS-BP (file cisbp_1.02.tf_families.sql in the package from “Download MySQL Tables” in http://cisbp.ccbr.utoronto.ca/bulk.php). Known TF binding motifs of metazoan TFs were downloaded from CIS-BP [2], JASPAR [25], UniPROBE [26] and hPDI [27]. DNA binding motif of crustacean TFs could be inferred when their DBDs were highly similar to TFs with DNA binding motifs that have been detected experimentally.

The confidence level of a predicted TF was estimated using several criteria: 1) the percentage of the top hit that matches the predicted TF, E-value, Blast score and sequence identity when Blast to the protein database SwissProt, 2) the bit-score and E-value of PfamScan, 3) the number of homologs found in other crustacean species, 4) the number of transcriptome samples from which the TF was detected. Crustacean TFs of each family were ranked according to the Blast E-value, PfamScan E-value, the number of crustacean homologs and the number of supported samples, respectively. The TFs were ranked according to the four criteria. The ranks of each TF imply the confidence of the predicted TF.

Evolutionary analysis

The DBD sequences of the crustacean TFs and TFs of 117 other animals in each TF family were aligned with Clustal Omega, which is a fast and scalable tool for multiple amino acid sequence alignment [28]. Approximately-maximum-likelihood phylogenetic tree of each TF family was reconstructed with FastTree with JTT + CAT model [29]. Trees were visualized by R package ggtree [30] and iTOL (Interactive Tree Of Life) [31]. Users can manipulate the trees interactively via iTOL.

Utility and discussion

The crustacean TFs in CrusTF are classified according to the species and functional domain types. Users can choose a species of interest from the species list to browse all crustacean TFs identified from its transcriptomes. In the current version, CrusTF has a total of 170 crustacean species (Table 1), of which only two species (Daphnia pulex and Artemia franciscana) are included in other TF databases (DBD and CIS-BP) based on genome sequences. Our collection covers 15 crustacean orders (Fig. 1a). Two major orders are Decapoda and Amphipoda, which have 68 and 70 species in our collection, respectively. The former is the order containing many economic crustacean species, like crabs, lobsters, crayfishes, and shrimps (http://www.fao.org/fishery/collection/asfis/en), while the latter contains many species for environmental monitoring [32]. Even though other orders only have 1 to 5 species, they are usually the most representative species in those orders. The high coverage of species in this subphylum allows the direct comparison of TF sequences among different crustaceans. So CrusTF provides the function of searching similar TFs in other species for each TF. Homolog search against crustacean TFs from transcriptomes, GenBank, and TFs from genomes of other animals could be easily achieved by click on the “Search” buttons in the main page of each TF. Users can investigate how homologous TFs are changed among different species.

Table 1 Crustacean species available in CrusTF
Fig. 1
figure 1

Statistics of CrusTF. a Number of species belonging to 15 orders of Crustacea. b Increase in the number of crustacean species of which transcriptomes or genomes have been published. All four databases belong to National Center for Biotechnology Information (NCBI). SRA Transcriptome: Transcriptomes (RNA-seq) in Short Read Archive; TSA: Transcriptome Shotgun Assembly database; NCBI Genome: NCBI genome database; WGS: Whole Genome Shotgun database. c Number of TFs identified in each species

When compared to the number of available crustacean genomes, the number of crustacean species with transcriptome data grows much faster in recent years (Fig. 1b). Thus, identification of TFs from de novo assembled transcriptomes would be a much more efficient approach for species without available genomes and an important supplement to the current methods based on whole genome sequences. However, as shown in Table 1, since the data volumes for different species vary, the coverage of transcriptomes may be quite different among species. The number of total unique contigs varies from 1102 to 641,047, due to the variations of library preparation methods, sequencing throughput and the number of samples. Therefore, the number of TFs predicted from different species ranges from 6 to 10,535 (Fig. 1c). This is because of the limitations of the transcriptomic approach. First, transcriptomes are usually incomplete when the sequencing throughput is low or the RNA samples used cannot cover a wide range of different tissues and conditions. When a gene is not expressed in the sampled conditions, it would not be sequenced. Thus, these species have less predicted TFs. Secondly, de novo assembly of transcriptome may generate many false transcripts, which could be filtered out by comparing to known genes or protein features. Yet we did not filter them out, because filtering assembled transcripts based on current knowledge may lead to loss of novel TFs that may be very important. Thus, we keep all predicted TFs and provide the information of evidence that support them (See Construction and content), from which users can easily estimate the reliability of the predicted TF and also have the opportunity to explore new TFs for further investigation. Despite its limitations, our transcriptomic approach is still an efficient and valuable way to predict TFs, especially when the genomes of many species are not available.

In the current version, CrusTF contains 131,941 and 8502 TFs of crustacean species from transcriptomes and GenBank, respectively. They are classified into 63 TF families according to the DBDs or DBD combinations detected in their sequences (Fig. 2). Detailed information of TF families are listed in and Additional file 1: Table S5. Many TF families that are prevalently detected in metazoans were detected in most crustaceans. Interestingly, crustaceans have distinct patterns of TF family composition when compared to other animals (Fig. 2). Several TFs families, such as families with Zinc finger CCCH domain and BED zinc finger, show extensive expansion in crustaceans. Some TF families with distinct DBD combinations may represent putative TFs unique to this animal group and have not been characterized in other TF databases (Fig. 2). Users can browse the TFs of a certain TF family by selecting it from the TF family list. And they can also download all TF sequences of a family from a species of interest or all crustacean species in the “Download” page.

Fig. 2
figure 2

TF families in crustaceans compared to those in other animals. Colors in the figure show the percentage of TFs in each TF family over all predicted TFs in a species (white, <1%; yellow to green, 1–100%). Each row is a species and each column is a TF family. Side bar highlights the taxa. Many TF families on the left that are prevalently detected in metazoans were detected in most crustaceans. Several TFs families, such as families with Zinc finger CCCH domain (CCCH ZF) and BED zinc finger (BED ZF), show extensive expansion in crustaceans. Some TF families with distinct DBD combinations may represent putative TFs unique to this animal group and have not been characterized in other TF databases

To facilitate the users to understand the phylogenetic relationships among crustacean DBDs and DBDs from other animals, the phylogenetic tree of each DBD type is available in the “Trees” page. These trees visualize the phylogenetic relationships among DBDs from our 170 crustaceans and 117 other animals of 9 phyla from Porifera to Chordata. Based on the phylogenetic relationships of DBDs, CrusTF has inferred the DNA binding motif for each TF from their closest TFs with motifs derived from experimental studies. Users can browse the motif information on the web page of each TF and download the motifs for further prediction of downstream targets.

Conclusion

In summary, CrusTF is the first TF database derived from transcriptome data. It uncovers the specific pattern of the transcriptional regulatory system of crustaceans and the diversity of TFs in this important group of animals. This database will constitute a key resource for the research community of crustacean biology and evolutionary biology. Given the importance of TF information in functional studies on transcriptional regulatory systems of crustaceans, it will facilitate the research works on growth, reproduction, and development of crustaceans, and subsequently benefit studies on crustacean fisheries, aquaculture, biodiversity conservation and environmental protection.